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ABSTRACT 

In  this  thesis  a  new  type  of  information  retrieval  system  is 
suggested  which  utilizes  data  of  the  type  generated  by  the  users  of  the 
system  instead  of  data  generated  by  indexers. 

The  theoretical  model  on  which  the  system  is  based  consists  of 
three  basic  elements.  The  first  element  is  a  measure  of  the  related¬ 
ness  between  document-pairs.  It  is  derived  from  information  theory. 

The  second  element  is  a  definition  of  vha„  constitutes  a  set  (cluster) 
of  J nter-related  documents.  This  definition  is  based  on  the  measure  of 
relatedness.  The  last  element  is  a  procedure  which  transforms  a  request 
for  information  into  a  cluster  of  answer  documents. 

Requests  are  made  by  designating  one  or  more  documents  to  be  of 
interest  and  perhaps  some  to  be  of  no  interest.  The  requestor  can 
continue  to  interact  with  the  procedure  as  it  locates  the  answer  cluster 
by  specifying  as  interesting  or  not  interesting  other  documents  which 
are  presented  to  him.  The  answer  cluster  which  is  generated  is  auto¬ 
matically  made  as  small  (specific)  or  as  large  (general)  as  is  desired, 
depending  on  the  initial  request  and  the  subsequent  interactions. 

An  experimental  system  was  developed  to  test  the  model  in  a 
realistic  environment.  It  was  programmed  for  the  Project  MAC  time¬ 
sharing  system  and  utilized  the  physics  data  file  of  the  Technical 
Information  Project.  Citations  were  used  as  the  data  base  for  the 
measure  of  relatedness.  A  file  structure  and  retrieval  language  were 
designed  which  allowed  close  man-machine  coupling. 

Experiments  were  conducted  which  compared  the  clusters  of  docu- 
:-'-nts  produced  by  the  experimental  system  with  various  sets  of  documents 
of  -Mown  mutual  pertinence.  These  sets  included  bibliographies  from 
review  articles,  subject  categories,  and  sets  of  documents  found  to  be 
of  interest  to  selected  users  of  the  system.  It  wsb  found  that  between 
60-90  %  of  the  doc'iments  of  known  pertinence  were  included  in  the 
corresponding  clusters.  Ways  of  improving  this  retrieval  efficiency 
even  further  are  suggested. 

Thesis  Supervisors  Robert  M.  F' no 
Title:  Ford  Professor  of  Engineering 
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PART  ONE:  INTRODUCTION 

This  thesis  is  divided  Into  four  partB.  In 
this  part  we  Introduce  the  project  by  describing 
results  of  related  work  and  by  discussing  the 
objectives  of  the  research.  In  Part  Two  the 
theoretical  model  on  which  the  project  is  based 
is  presented.  Part  Three  contains  a  description 
of  the  experimental  system  which  was  developed  to 
test  the  model.  In  the  final  part  we  present  the 
experimental  results  and  the  conclusions  about  the 
theoretical  model  that  cen  be  drawn  from  them. 
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CHAPTER  I 
BACKGROUND 


1.1  Introduction 

In  a  pioneering  article  written  at  the  close  of  World  War  II,  Dr. 
Vannevar  Bush,  Director  of  the  Office  of  Scientific  Research  and  Develop¬ 
ment,  called  on  scientists  to  redirect  their  energies  to  creating  "a  new 
relationship  between  thinking  man  and  the  sun,  of  cur  knowledge."  He 
noted  that  "our  methods  of  transmitting  and  reviewing  the  results  of 
research  are  generations  old  and  by  now  are  totally  inadequate."10 

His  challenge  to  mechanize  and  streamline  the  library  process  has 
been  accepted  by  numerous  groups  in  the  intervening  twenty  years,  A 
large  number  of  devices  have  been  developed  which  mechanically  or 
electronically  select  information  from  a  store.  Methods  of  automatically 
indexing,  classifying,  and  abstracting  documents  have  been  devised.  A 
myriad  of  other  disciplines  have  been  called  in  for  assistance. 

Before  attempting  to  review  and  evaluate  this  activity,  it  is 
extremely  Important  that  the  implied  "inadequacies"  of  traditional 
library  methods  be  clearly  defined.  Only  then  can  one  hope  to  deter¬ 
mine  the  effectiveness  of  any  given  approach  in  resolving  these  problems. 

l._2_ _ Areas  Heeding  Improvement 

Six  general  aspects  of  library  systems  have  been  chosen  as  impor¬ 
tant  artas  which  need  improvement  and  which  appear  to  be  amenable  to 
improvement  through  some  type  of  mechanisation.  Most  information 


lii 

storage  and  retrieval  projects  have  had  as  their  stated  or  implied  goals 
one  or  more  of  these  objectives. 

1.21  Closer  Man-System  Coupling 

In  many  cases  a  user  who  comes  to  an  Information  system  cannot 
state  precisely  vhat  he  wants.  He  has  a  very  real  need  for  information, 
but  he  cannot  define  exactly  what  that  need  is  verbally.  In  other 
cases  a  user  can  accurately  specify  his  interests  but  changes  his  mind 
as  to  what  he  wants  when  he  finds  that  there  are  too  many  or  too  few 
articles  which  satisfy  the  request. 

Unfortunately  most  systems  (automatic  and  manual)  are  designed  for 
that  rare  individual  who  knows  exactly  what  he  wants  and  what  the  stack 
contains.  In  these  systems  there  is  a  clear  demarkation  between  request 
specification  by  the  user  snd  answer  presentation  by  the  system. 

A  much  closer  coupling  of  man  and  sysv.em  is  generally  needed  so 
that  each  can  contribute  to  the  best  of  his  (its)  ability  at  each  step 
in  the  search.  For  example,  the  system  might  help  the  user  in  formulating 
the  request  by  noting  with  each  change  in  the  request  the  probable  number 
of  documents  in  the  final  answer,  by  presenting  representative  documents 
for  evaluation,  and  by  ranking  the  output  according  to  degree  of  related¬ 
ness.  The  user,  on  the  other  hand,  could  help  the  system  find  the  desired 
answer  by  catching  and  correcting  possible  misunderstandings  of  the 
request  as  early  in  the  search  as  possible,  by  narrowing  or  broadening 
the  request  if  the  siic  of  the  expected  answer  becomes  too  lsrge  or  too 
small,  and  by  continually  refining  the  request  based  on  the  information 


supplied  by  the  system. 


1.22  More  Flexibility  In  Requests 


Even  If  it  is  assumed  that  a  user  can  adequately  specify  bis 
Interests,  there  la  still  the  difficulty  of  matching  his  request  vocab¬ 
ulary  with  the  vocabulary  of  the  Indexer.  Perhaps  the  user  is  looking 
for  books  on  "information  retrieval"  but  fails  to  realize  that  the 
classifier  posted  such  books  under  'documentation".  Of  course,  the 
classifier  may  have  foreseen  this  difficulty  and  placed  a  "see"  card 
under  information  retrieval.  However,  this  does  not  always  occur. 

Another  basic  problem  is  faced  by  the  person  who  knows  a  given 
paper  or  a  given  author  of  interest  but  is  forced  to  translate  this 
knowledge  into  a  set  of  descriptors  Instead  of  being  able  to  feed  it 
in  directly  as  a  request. 

More  flexibility  is  needed  in  the  allowable  vocabulary,  language 
structure,  and  type  of  Information  which  can  be  specified  in  a  request. 

1.2J  Physical  Barriers 

The  mere  physical  separation  of  the  user  from  the  library  presents 
a  barrier  that  has  a  greater  Impact  than  we  nay  realize.  This  is  also 
true  of  the  separation  of  the  card  file  from  the  stacks.  Evidence  of 
the  importance  of  this  factor  is  found  in  the  popularity  of  small 
special  collections  distributed  throughout  a  large  organization  and  in 
the  personal  libraries  maintained  by  most  research  workers . 

There  la  elao  the  time  barrier.  If  a  person  could  get  an  answer  to 
hi*  problem  in  five  minutes,  he  might  be  Interested.  Whereas  he  might 
decide  to  bypshi  the  problem  if  it  takes  one-half  hour  or  more.  A 
third  barrier  is  cost.  This  factor  la  not  a  direct  consideration  to  the 
user  in  most  cases  because  no  direct  fee  is  levied  for  use  of  s  library. 
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1.2l>  Quality  of  Selection  Information 

All  libraries  provide  the  user  with  certain  types  of  information 
which  help  him  to  select  froa  the  total  store  those  books  which  are  of 
interest  to  him  without  having  to  scan  the  text  of  each  book.  Even 
those  libraries  which  cater  to  the  browser  generally  arrange  books  by 
content  on  the  shelves  and  place  the  spine  out  bo  that  the  title  and 
author  can  be  seen  at  a  glance. 

There  are  at  least  three  important  factors  which  must  be  considered 
in  the  generation  of  selection  information  for  a  given  document. 

1.  The  actual  contents  of  the  document. 

2.  The  collection  in  which  the  document  will  reside. 

j.  The  needs  and  characteristics  of  the  user  population 
serviced  by  the  collection. 

If  the  only  factor  to  be  considered  in  indexing  were  the  contents 
of  the  docusent,  then  a  valid  method  for  indexing  would  be  to  have  each 
author,  as  the  final  authority  on  what  the  dccument  contains,  index  it. 
However,  libraries  have  found  that  the  other  two  factors  are  also 
i’qp°rbant  and  that  an  author  cannot  be  expected  to  be  familiar  with 
each  library  and  each  user  population  that  might  have  his  book  or 
article. 

The  approach  used  by  conventional  libraries  is  to  rely  on  an 
indexer  or  classifier  to  generate  the  selection  Information  needed. 

This  type  of  individual  la  usually  ar  expert  on  the  contents  of  the 
library  collection,  but  knows  such  less  about  the  first  and  third 
factors.  He  usually  has  about  10-15  minutes’  time  to  determine  what 
the  author  of  the  document  has  said  and  predict  the  types  of  users  this 
lnformatiof  will  be  of  Interest  to  (through  the  categories  selected); 
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all  this  vith  little  direct  involvement  in  the  field  or  area  in  question. 
The  amazing  part  about  the  whole  process  is  that  an  indexer  can  some¬ 
times  come  up  with  a  sketchy,  but  fairly  useful  portrayal  of  the  docu¬ 
ment. 

An  additional  problem  is  that  much  of  the  literature  (periodicals, 
technical  reports,  etc.)  never  even  receives  the  attention  of  an  indexer. 

1.2$  Restrictive  Classification  Model 

Even  if  the  classifier  were  able  to  determine  the  exact  contents  of 
a  document,  he  would  still  find  difficulty  in  fitting  his  findings  into 
the  rigid  classification  systems  currently  in  use  (Dewey  Decimal, 

Library  of  Congress,  etc.). 

First,  the  classifier  is  allowed  only  a  yes-no  type  of  response. 
Either  the  document  is  placed  in  a  given  category  or  it  is  not— there  is 
no  middle  ground,  no  partial  relationship. 

Next  there  is  the  "broken  relationship"  problem  inherent  in  hier- 
archal  classification  structures.  No  matter  where  a  category  is  placed 
in  the  hierarchy  tree,  there  are  related  fields  to  which  it  cannot  be 
adjacent.  For  example,  if  the  history  of  physics  is  placed  in  the 
science  area,  it  loses  its  connection  v,o  history  and  vice-versa.  This 
problem  is  only  partially  alleviated  by  the  "see"  and  "see  also" 
artifices. 

Third,  there  is  the  difficulty  encountered  in  changing  a  classifica¬ 
tion  structure  to  fit  vith  our  current  body  of  knowledge.  TMs  involves 
considerable  expansion  and  contraction  of  areas  along  with  insertion  of 
entirely  new  fields  and  the  deletion  of  obsolete  on*s.  Tbe  old  classi¬ 
fication  frame. ork  eventually  becomes  so  straineu  in  certain  area*  that 
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there  Is  danger  of  collapse. 

Bach  of  these  difficulties  encountered  in  the  classification  of 
documents  generates  a  corresponding  difficulty  for  the  user.  V.  Bush 
described  the  use  of  a  classification  system  in  this  way. 

"...information  is  found  (when  it  is)  by  tracing  it  down 

from  subclass  to  subclass.  It  can  be  in  only  one  place, 

unless  duplicates  ere  used;  one  has  to  have  rules  as  to  which 

path  will  locate  it,  and  the  rules  are  cumbersome.  Having 

found  one  item,  moreover,  one  has  to  emerge,  from  the  system 

«10 

and  re-enter  on  a  new  path. 

1.26  Heed  for  Dynamic  Indexing 

Consideration  of  the  problem  of  Indexing  leads  one  to  the  con¬ 
clusion  that  there  is  no  intrinsic  content  to  a  document  which,  when 
once  properly  characterized  by  an  appropriate  set  of  words  or  phrases, 
ia  then  adequately  indexed  for  all  situations  and  all  users.  In  reality 
the  depth  and  type  of  indexing  needed  depends  both  on  the  character¬ 
istics  of  the  collection  in  which  the  document  is  imbedded  and  on  the 
Interests  of  the  user  population  to  be  serviced  by  the  collection  at 
the  time. 

Or-e  this  point  is  conceded  then  it  becomes  apparent  that  the  way 
a  document  Is  Indexed  must  change  ss  the  co1  . ction  and  user  population 
vary.  One  of  the  major  drawbacks  of  conventional  indexing  methods  Is 
that  in  practice  they  are  static.  K  document,  once  indexed,  is  almost 
never  re-indexed.  Indeed  some  people  believe  that  a  properly  Indexed 
document  ehould  never  need  re-indexlng.  R.  A.  falithorne  claims  the 


following-- 


"We  have  to  assume  that  a  classifier  can  decide  that  a 
text  is  relevant  to  a  topic  in  such  a  way  that,  apart  from 
blunders,  neither  future  development  nor  decisions  elsewhere 
shall  compel  revision.  Future  developments  certainly  should 
not  upset  any  decision  about  relevance;  if  an  item  is  relevant 

to  some  topic,  it  will  always  be  relevant,  though  the  relevance 
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may  become  unimportant  and  new  relevancies  may  be  added." 

The  case  for  dynamic  indexing  was  clearly  presented  by  M.  M. 
Kessler: 

"Indexing  must  be  fluid  and  dynamic,  reflecting  the 
changing  needs  of  society  and  the  contributions  of  new  insights. 
It  is  most  unlikely  that  anybody,  be  he  expert  scientist  or 
expert  indexer,  can  read  a  given  paper  at  a  given  time  and  see 
enough  of  its  implications  to  classify  it  once  and  for  all.  If 
this  philosophy  of  classification  were  accepted,  as  It  now  is, 
the  resulting  system  would  impose  such  a  rigidity  upon  the  flow 

of  information  that  the  working  scientist  would  be  forced  to 

.  .  ,26 
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1.3  Evaluation  of  Previous  Efforts 

It  would  be  impossible  to  describe  all  of  the  work  which  has  been 
undertaken  in  the  field  of  Information  retrieval  and  documentation  in 
the  last  20  years.  What  will  be  attempted  here  is  an  analysis  of  cer¬ 
tain  representative  efforts  In  each  of  six  broad  areas. 


1.31  Hardware  Developments 

Many  Interesting  machines  have  been  developed  for  use  in  informa¬ 
tion  processing  (Rapid  Selector,  Peeaaboo,  2mtor,  walnut,  Hinlcard, 
general  purpose  computers,  etc.).  Instead  of  discussing  the  specific 
capabilities  of  these  machines,  let  ua  nota  some  of  the  general  trenda 
in  hardware  development  which  promise  o  ha/e  the  greatest  l^act  or 


information  retrieval 


The  first  would  he  the  development  of  multiply-accessed  (time- 
sharing)  computers.  A  research  worker  with  a  connection  to  such  a 
computer  would  be  able  to  query  a  large  central  store  of  information 
directly  from  his  office,  laboratory,  or  home  and  receive  an  almost 
immediate  response.  This  is  in  contrast  to  the  batch-processing  com¬ 
puter  which  processes  requests  in  groups  at  a  central  location  and 
usually  involves  delays  in  response  of  from  several  hours  to  several 
days.  A  brief  description  of  a  particular  time-sharing  system  (the  one 
used  by  this  research  project)  can  be  found  in  Sec.  6.1. 

A  system  of  users  interacting  with  n  large  central  information 
store  through  a  time -shared  computer  offers  another  important  capability 
that  might  be  overlooked.  Hot  only  can  the  user  obtain  information 
from  the  system,  but  the  system  can  also  monitor  the  user.  This  moni¬ 
tored  usage  data  could  be  collected  at  little  or  no  inconvenience  to 
the  user.  It  would  complete  the  information  loop  with  feedback  from 
the  user  continually  modifying  and  improving  system  performance. 

Another  significant  hardware  advancement  is  the  development  of 
larger  and  larger  mass  memories.  It  is  estimated  that  all  of  the  text¬ 
ual  information  in  the  20  million  documents  in  the  Library  of  Congress 

could  be  stored  in  a  10  trillion-bit  (lO^)  memory.  Current  random 
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access  devices  store  10  -  10  bits, while  large  magnetic  tape  install¬ 

ations  have  a  capacity  of  10^  bits.  Random  sccess  storage  devices  have 

12 

been  announced  in  the  10  bit  range.  It  would  appear  that  continued 
progress  may  soon  eliminate  storage  capacity  as  a  limiting  factor  in 
the  mechanization  of  large  information  retrieval  systems. 
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A  parameter  closely  related  to  memory  size  is  access  time. 

g 

Typical  access  times  to  any  part  of  a  10  -bit  file  on  a  random  access 
disc  are  currently  100  ms.  The  real  problem  is  in  knowing  which  part 
of  the  file  to  read.  Perhaps  associative  memories,  complete  file 
inversion,  or  some  other  artifice  will  resolve  this  problem. 

1.32  Indexing  Methods  and  Models 

As  important  as  hardware  developments  are,  V.  Bush  pointed  out  an 
even  more  basic  problem. 

"The  real  heart  of  the  matter  of  selection,  however, 
goes  deeper  than  a  lag  in  the  adoption  of  mechanisms  by 
libraries,  or  a  lack  of  developirent  of  devices  for  their 
use.  Our  ineptitude  in  getting  at  the  record  is  largely 
caused  by  the  artificiality  of  systems  of  indexing."10 

The  'systems  of  indexing'  to  which  Bush  referred  are,  of  course, 

the  traditional  subject  catalog  and  classification  schemes  still  in  use 

(Universal  Decimal,  Library  of  Congress,  etc.).  Some  of  the  drawbacks 

of  these  classification  systems  were  discussed  in  Section  1.-25. 

Beginning  about  1950  efforts  were  made  to  replace  these  convention- 

I  o 

al  classification  methods.  One  result  was  coordinate  indexing."  In 
coordinate  indexing  documents  are  assigned  Uniterms  or  descriptors 
(usually  single  words).  These  descriptors  are  given  no  hiersrchal  or 
other  structure.  A  request  consists  of  certain  descriptors  connected 
by  the  logical  and-or-not  operations. 

Coordinate  indexing  eliminated  many  of  the  difficulties  encountered 
in  hierarchal  classifications  and  subject  catalogs.  However,  its 
strength  was  also  its  shortcoming.  The  elimination  of  all  order  and 
structure  from  the  descriptors  introduced  many  'false  drops’.  For 
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example,  a  hypothetical  user  looking  for  papers  on  the  causes  of  blind- 

* 

ness  in  Venice  might  also  retrieve  articles  on  the  design  of  Venetian  i 

\ 

blinds,  lb  reintroduce  that  which  was  lost  by  eliminating  descriptor 

;4 

context  and  order,  such  features  as  role  indicators  were  used. 

Currently  seme  workers  in  the  field  see;,  to  be  disenchanted  with 
coordinate  indexing  and  have  shifted  reluctantly  back  to  the  conventional 
classification  methods. ^ 

Another  field  of  endeavor  was  in  the  modeling  area.  A  number  of 
models  were  proposed  which  described  the  indexing  and  retrieval  functions. 
Unfortunately  that  was  all  that  these  models  did  -  they  provided  an 
alternate  way  of  describing  an  already  familiar  problem.  Ho  new  insights 
were  gained  and  no  helpful  procedures  resulted. 

1.33  Mew  Bases  for  Selection  Information 

It  has  already  been  noted  that  all  library  systems  depend  on 
selection  information  (classification  categories,  subject  headings, 
author  indexes,  etc.)  to  locate  documents  relevant  to  a  particular 
request.  Customary  library  practice  is  to  depend  on  the  indexer  to 
produce  this  information.  Section  1.2li  outlines  6ome  of  the  diffi¬ 
culties  inherent  to  this  dependence. 

Studies  during  the  past  eight  years  have  been  undertaken  to  see  if 
selection  information  generated  by  indexers  can  be  supplemented  and  per¬ 
haps  replaced  by  that  generated  by  the  automatic  processing  of  a  docu¬ 
ment's  contents. 

At  first  simple  methods  of  exploiting  the  information  found  in  a 
document  ‘were  tried.  Permuted  title  indexes  and  citation  indexes  met 
with  some  success.  Ir.  19^3  t.uhn  proposed  automatic  abstracting.  ^ 
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This  consisted  of  the  selection  of  certain  words  as  the  keywords  of  a 

document  based  on  their  frequencies  of  occurrence.  Hie  sentences  and/ 

or  phrases  which  contained  these  words  were  then  extracted  to  form  the 

auto-abstract  of  the  document.  The  idea  was  then  extended  by  Maron  in 

1961  to  the  automatic  indexing  of  documents  with  the  keywords  extracted 
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becoming  the  descriptors.  ’ 

Automatic  Indexing  was  about  50  %  successful  in  assigning  documents 
to  the  same  categories  that  the  human  Indexer  did.1^  This  mediocre 
showing  can  be  attributed  to  the  fact  that  machine  indexing  did  not  ^ 
make  use  of  the  order,  context,  syntax  and  synonyms  of  the  words 
extracted.  This  in  essence  is  the  same  difficulty  found  in  coordinate 
indexing.  Some  of  the  subsequent  efforts  at  automatic  Indexing 
attempted  to  account  for  syntax,  but  this  trail  encountered  the  same 
massive  obstacles  that  had  already  slowed  progress  in  automatic  language 
translation. 

Thus  after  some  initial  success,  the  automatic  generation  of 

selection  information  based  on  document  contents  ran  aground.  One 

cannot  dispute  the  fact  that  a  description  of  the  subject  covered  by 

the  article  is  contained  within  the  article.  Just  how  one  can  capitalize 

on  that  knowledge  ia  the  problem.  The  needed  information  is  there,  but 

machines  and  indexers  currently  can  extract  only  a  part  of  it. 

There  is  one  notable  exception  to  the  above  comment.--.  Hie 

citations  found  in  articles  do  not  have  the  same  type  of  synonym  and 

syntax  problems  that  textual  material  does.  Thus  selection  Information 

generated  from  citations  has  had  considerable  success  for  those  bodies 

20 

of  literature  which  have  a  good  citation  base, 

A  discussion  of  the  user  of  a  library  as  a  source  of  selection 
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information  will  be  postponed  until  Chapter  II,  since  little,  if  any, 
prior  experimental  work  has  been  done  In  this  area. 

1.3h  Measures  of  Relevance 

In  conventional  library  systems  documents  sre  assigned  to 
categories  and  subject  headings  on  a  yes-no  sort  of  basis.  Either  the 
document  is  in  the  category  or  it  is  not--there  is  no  middle  ground. 

The  restrictive  nature  of  this  type  of  arrangement  was  pointed  out  by 
Maron  and  Kuhns  in  1960.^  They  proposed  that  an  8 -value  weighted 
indexing  scheme  be  used  to  represent  the  degree  to  which  a  document  is 
related  to  a  term. 

This  idea  was  extended  to  thesauri  by  Stiles  in  1961^  A  tradi¬ 
tional  thesaurus  allows  terms  to  be  listed  as  synonyms  or  antonyms  but 
the  degree  of  synonymity  is  left  unspecified.  Stiles  proposed  an 
association  factor  to  represent  the  amount  of  synonymity  between  terms. 

Numerous  other  'measures  of  relevance'  between  the  various 
entities  of  libraries  have  been  proposed  since.  Some  of  the  better 
known  of  these  measures  are  tabulated  in  Appendix  A.  Unfortunately, 
there  appears  to  be  considerable  confusion  over  exactly  what  these 
measures  represent,  and  the  use  of  the  term  'relevance'  would  seem  to 
add  to  this  confusion. 

Many  docuiaentolists  now  speak  with  some  assurance  about  the  amount 
(to  3  or  U  significant  figures)  of  'relevance'  of  a  document  to  a 
category  or  to  a  request.  The  're.’evance  ratio’  is  an  accepted  way  to 
measure  information  retrieval  system  efficiency.  All  too  often  these 
comments  leave  one  with  the  impression  that  there  is  some  intrinsic 
meaning  to  n  word  or  document  ■.•hieh  has  now  been  quantitatively  described, 


25 


when  in  reality  all  that  has  been  accomplished  is  the  invention  of  some 
type  of  frequency  ratio. 

In  traditional  library  work  confusion  also  appears  to  exist.  Indeed 
the  very  idea  of  classification  implies  to  some  tLt  there  is  some 
inherent  content  of  a  document  which  must  be  indexed.  The  already  quoted 
comment  by  R.  A.  Fairthoren  can  be  cited  as  an  expression  of  the 
attitude  of  some  classifiers. 

"Future  developments  certainly  should  not  upset  any 
decision  about  relevance;  if  an  item  is  relevant  to  sene 
topic,  it  will  always  be  relevant,  though  the  relevance  may 
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become  unimportant  and  new  relevancies  may  be  added." 

Let  us  suggest  that  the  intrinsic  meaning  or  concept  behind  a  word 
is  a  philosophical  problem  and  cannot  be  dealt  with  operationally. 

Those  aspects  of  a  document  which  do  not  influence  its  environment  (i.e. 
the  library  and  the  user)  are  of  no  practical  significance  because  they 
cannot  be  observed,  measured,  or  even  proved  to  exist. 

To  avoid  adding  further  to  this  misunderstanding  we  shall  avoid  the 
use  of  the  word  'relevance'  in  the  rest  of  this  paper.  The  frequency 
ratios  used  by  this  project  will  be  termed  'measures  of  relatedness'. 

It  is  hoped  that  this  term  is  less  loaded  with  connotations  of  intrinnic 
meaning. 

I.j5  Automatic  Classification  and  Clumping  Experiments 

After  automatic  indexing  was  proposed  for  the  assignment  of  docu¬ 
ments  to  categories,  it  was  only  natural  that  the  automatic  determina¬ 
tion  of  the  cstegorlea  themselves  should  be  triad  also.  This  was  done 
initially  by  borrowing  two  techniques  from  mathematical  psycholofy-- 
factor  analysis  and  latent  class  analysis.  Factor  analysis  Is  *  *ed  to 
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discover  the  underlying  factors  which  account  for  the  performance  of  a 
group  of  people  to  a  battery  of  testa.  Latent  class  analysis  Is  a 
procedure  used  to  divide  a  group  of  people  into  disjoint  sub-groups  on 
the  basis  of  their  responses  to  a  questionnaire. 

latent  class  analysis  for  information  retrieval  has  not  yet  been 
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experimentally  tested.  *  Borko's  work  with  factor  analysis  was  based 
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on  the  occurrence  of  keywords  in  document  abstracts.  A  correlation 
matrix  of  keywords  versus  keywords  was  formed  and  was  factor  analyzed, 
resulting  in  categories  which  had  some  resemblance  to  those  manually 
selected  for  the  same  corpus. 

An  even  earlier  attempt  at  automatic  classification  was  tried  by 
Heedham  and  Parker-Rhodea  in  England. 38>39>i*l  njey  caued  it  clumping 
and  produced  a  heuristic  procedure  which  selected  clumps  of  documents 
from  a  file.  Their  work  has  been  extended  in  this  country  by  Dale1^ 
and  also  by  Bonner.'’ 

Since  clunplng  is  the  moBt  closely  related  endeavor  to  the  object¬ 
ives  of  this  project  of  any  to  date,  a  slightly  more  extended  description 
of  the  results  will  be  given.  A  library  collection  is  though^  of  as  a 
network  with  the  nodes  representing  documents  and  values  assigned  to 
the  links  (usually  0  or  1  only).  This  collection  is  partitioned  into 
two  subsets,  A  and  B.  The  sum  of  the  links  internal  to  A  is  denoted  by 
AA  and  the  sum  of  the  links  Internal  to  B  is  denoted  by  BB.  The  only 
other  links  in  the  network  are  those  which  cross  from  set  A  to  set  B. 

The  sum  of  these  links  is  denignated  AB. 

A  OB  clump  is  defined  as  sny  set  A  which  produces  a  local  minimum 
of  the  function  f(A).1^ 


AA  ♦  BB 
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A  more  recent  type  of  clump,  the  D  clump,  is  defined  as  any  set  A 

12 

which  produces  a  local  minimum  of  the  function  G(a). 


GR  clumps  are  fairly  easy  to  locate.  Some  additional  restrictions 
must  be  placed  on  D  clumps  to  make  the  definition  useful  since  local 
minima  of  0(a)  occur  for  quite  unrelated  sets  of  documents.  The  latest 
effort  has  been  to  find  an  initial  set  of  items  by  some  other  method  and 
then  use  the  D-clump  method  to  complete  the  set. 

Both  the  automatic  classification  and  the  clumping  experiments  are 
designed  so  that  all  of  the  classifying  and  indexing  would  be  completed 
before  the  requests  are  processed. 

1.36  Systems  Evaluation 

The  most  widely  accepted  method  of  evaluating  the  performance  of 
information  retrieval  systems  is  currently  through  the  recall  and 
relevance  ratios. **  Hie  recall  ratio  is  the  percentage  of  relevant 
items  that  are  actually  retrieved  and  the  relevance  ratio  is  the  percent¬ 
age  of  retrieved  items  that  are  relevant. 

In  determining  what  is  or  is  not  relevant,  recourse  is  usually 
made  to  an  indexer  or  a  user.  Recent  studies  have  shown  that  these 
people  are  able  to  agree  among  themselves  as  to  how  documents  should  be 
classified  in  at  most  50%  of  the  eases.  This  "failure"  of  humans  to 
index  consistently  has  led  some  to  try  to  find  better  automatic  "non- 
Judgemental"  standards  on  which  to  validate  relevance. ^ 

If  the  primary  objective  of  a  library  is  in  serving  a  given  user 
population,  then  it  is  difficult  to  imagine  that  there  could  be  any 
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criteria  for  relevance  other  than  one  based  on  those  users.  If,  on  the 
other  hand,  the  function  of  a  library  is  to  set  up  a  universal  classi¬ 
fication  system,  then  the  user  should  certainly  be  eliminated  as  the 
standard  on  which  system  efficiency  is  evaluated. 

the  idea  that  the  users  of  a  system  can  "fail"  in  classifying  a 
document  implies  an  intrinsic  content  in  documents  which  one  or  more  of 
the  users  has  not  recognized.  A  more  practical  outlook  in  keeping  with 
the  arguments  of  Sec.  1.3U  is  that  these  differences  in  indexing  are 
only  the  normal  result  of  individual  backgrounds  and  interests. 


CHAPTER  II 

OBJECTIVE  OF  THIS  PROJECT 

2.1  Brief  Description  of  Project  Objective 

Let  us  assume  for  a  moment  that  we  wish  to  design  ah  information 
storage  and  retrieval  system  which  is  based  on  feedback  from  users.  In 
this  system  each  request  for  information  is  to  consist  of  a  set  of  one 
or  more  documents  that  the  user  has  already  found  to  be  of  Interest  and 
a  second  (possible  empty)  set  of  documents  that  he  knows  are  not  of 
interest. 

The  purpose  of  each  interaction  of  a  user  with  the  system  is  to 
transform  a  request  of  this  type  into  a  partitioning  of  the  total  collec¬ 
tion  into  two  disjoint  subsets--one  containing  all  documents  that  are  of 
interest  to  the  user  and  the  other  containing  those  not  of  interest  (the 
rest  of  the  stack).  This  process  is  to  be  accomplished  Jointly  Cy  the 
user  and  the  system. 

The  feedoack  which  the  system  stores  for  use  in  answering  future 
requests  is  to  consist  of  these  file  partitionings.  A  measure  of  the 
relatedness  between  any  two  documents  based  on  their  usage  and  co-usage 
patterns  as  found  in  the  partitionings  is  to  be  utilised  to  facilitate 
the  request -to-snsver  transformation. 

The  docuassnt  collection  of  sucn  a  system  csn  be  thought  of  as  a 
network  where  each  node  represents  s  document  ana  *n  :h  link  is  given  e 
value  corresponding  to  the  measures  of  reistedneso  between  the  two 


United  documents. 


The  objective  of  this  research  endeavor  Is  to  devise,  test,  and 
evaluate  a  procedure  which  will  perform  the  transformation  of  request 
to  answer  partition  for  this  type  of  retrieval  system. 

In  the  above  discussion  we  suggested  for  purposes  of  illustration 
a  retrieval  system  based  on  file  partitionings  which  are  generated  by 
the  users  of  the  system.  Partitioning  information  of  this  sort  would 
not  be  available  for  documents  that  have  just  been  added  to  a  file. 
Indeed,  such  information  is  not  readily  available  for  any  file  of  docu¬ 
ments  at  th.  present  time. 

There  are.  however,  some  types  of  partitionings  which  are  available. 
Take,  for  example,  the  citations  in  an  article.  The  author  of  an  article 
selects  for  citation  certain  documents  that  he  feels  are  pertinent  to 
the  article  he  has  written.  In  a  sense  he  is  a  special  type  of  user  of 
the  library  and  has  created  a  meaningful  partition  of  the  file.  Other 
types  of  partitionings  of  the  file  could  also  be  suggested. 

Usage  information  was  selected  for  discussion  here  because  it  is 
an  interesting  and  representative  example  of  the  larger  class  of  parti¬ 
tioning  information  for  which  we  propose  to  design  a  retrieval  system. 

In  the  remainder  of  this  chapter  and  in  the  next  chapter  we  will, 
therefore,  continue  to  talk  in  terms  of  the  partitionings  generated  by 
users.  It  should  be  understood,  however,  that  the  type  of  retrieval 
uystem  to  be  developed  need  not  be  restricted  to  this  single  type  of 
partitioning  data. 

In  the  next  section  we  will  present  some  arguments  for  and 
againat  information  retrieval  baaed  on  uaage  information.  '*«  will  men 
discus*  hov  uaage  information  can  best  be  represented  end  utilUed. 
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2 .2  Value  of  Usage  Information 

In  the  article  already  cited  at  the  beginning  of  Chapter  1,  V. 

Buah  auggeated  that  an  individual's  peraonal  information  storage  and 

selection  system  could  be  based  on  direct  connections  between  documents 

instead  of  the  usual  connections  between  index  terms  and  documents. 

These  direct  connections  were  to  be  stored  in  the  form  of  trails  through 

the  literature.  Then  at  any  future  time  the  individual  himself  or  one 

of  his  friends  could  retrace  this  trail  from  document  to  document  with- 

out  the  necessity  of  describing  each  document  with  a  set  of  descriptors 

10 

or  tracing  it  down  through  a  classification  tree. 

In  1956  R.  M.  Fano  suggested  that  a  similar  approach  might  prove 

useful  to  a  general  library.  He  proposed  that  "the  concomitant  use  of 

documents  by  experts  as  evidenced  by  library  records,  end  other  similar 

19  19 

joint  eventa"  might  be  a  useful  basis  for  document  retrieval.  1  Hia 
proposal  evoked  a  number  of  adverse  coaments,  two  of  which  will  be  quoted 
here. 


2.21  Objections 

A  theoretical  objection  to  basing  retrieval  on  usage  was  raised  by 
Y.  Bar-Hlllel. 

"A  colleague  of  mine,  a  well  known  expert  on 
information  theory,  proposed  recently,  as  s  useful  tool  for 
literature  search,  the  coaqpiiing  of  peir-lists  of  documents 
that  are  requested  together  by  ueers  of  llbreriea.  Re  even 
suggested,  if  I  understood  him  rightly,  that  the  frequency 
of  such  co-requesto  might  conceivably  serve  ns  an  indicator 
of  the  degree  of  reiatedness  of  the  topics  treated  in  these 
document,*. 

"I  believe  that  this  propoeei  should  be  treated 
with  the  greatest  re*crve.  Although  much  less  ambitious 


32 


than  Taube's  proposal  of  an  association  dictionary,  it  is  in 
many  respects  strikingly  analogous  to  it  and  shares  its  short 
comings.  The  fact  that  a  co-requeatedness  chain  of  documents 
can  be  easily  followed  up  by  a  machine  is  not  in  itself  a 
sufficient  reason  for  making  the  assumption  that  this  relation 
might  be  a  useful  approximation  to  the  important  relation  of 
dealing-with-related-topics  hetwem  documents.  And  one  can 
think  of  many  other  easily  establishable  relationships  between 
documents  that  stand  a  better  chance  of  being  a  useful  approxi¬ 
mation,  e.g.  co-occurrence  of  their  references  in  reference 

lists  printer  at  the  end  of  many  documents,  co-quotation,  and 

..2 

so  on. 

The  shortcoming  of  'Taube's  proposal’  referred  to  in  this  quote  is 
the  familiar  triangle  argument. 

"Knowing  that  'a'  and  ’b'  co-occur. . .and  that  'b'  and  'c' 
co-occur ...  what  do  we  know  about  the  connection  between  the 
'ideas'  'a'  and  'c'?  Clearly,  nothing  definite  whatsoever..." 

What  Bar-Hillel  says  is  true  also  of  hierarchal  classification 
systems  where  the  adjacency  of  categories  a  and  b  and  of  categories  b 
and  c  proves  nothing  about  the  relationship  of  a  and  c.  It  is  true  of 
any  system  consisting  of  a  set  of  items  and  characteristics  that  cannot 
be  described  by  some  type  of  metric  space. 

On  the  other  hand  the  fact  that  documents  a  and  c  are  not  related 
in  every  case  when  linked  through  a  third  document  b  is  more  of  a  hypo¬ 
thetical  objection  than  a  practical  one.  If,  in  fact,  items  with  the 
a-c  type  connection  are  found  to  be  related  on  the  average  much  more 
frequently  than  items  chosen  at  random,  then  the  usefulness  of  this  type 
of  connection  in  document  selection  should  not  be  overlooked. 

A  second  objection  to  Fano's  suggestion  was  raised  by  C.  N.  Mooers. 
It  is  a  practical  instead  of  a  theoretical  objection. 


"To  provide  feedback  for  improving  machine  performance 
Fano  and  others  have  suggested  the  use  of  statistics  of  the 
way  which  people  use  the  library  collection.  Though  tb-- 
suggestion  points  in  the  right  direction,  I  think  this  kind 
of  feedback  would  be  a  rather  erratic  source  of  information 
on  equivalence  classes,  because  people  might  borrow  books  on 
Jack  London  and  Albert  Einstein  at  the  same  time.  Although 
this  difficulty  can  be  overcome,  there  is  a  more  severe problem. 

Any  computation  of  the  number  of  people  entering  a  library  and 
the  books  borrowed  per  day,  compared  with  the  size  of  the 
collection  shows,  I  think,  that  the  rate  of  accumulation  of 
such  feedback  information  would  be  too  slow  for  the  library 
machine  to  catch  up  to  and  get  ahead  of  an  expanding  technology."^ 

Mooers'  objection  assumes  that  the  capability  of  accepting  feedback 
from  the  user  is  to  be  superimposed  on  a  conventional  library  structure 
and  that  it  will  have  little  net  effect  on  the  frequency  of  use  of  that 
library.  Let  us  accept  these  assumptions  for  the  moment  and  suggest 
some  reasons  why  usage  information  would  still  prove  profitable. 

First,  libraries  might  well  find  it  helpful  to  share  usage  patterns 
and  thereby  increase  the  total  information  available  to  any  one  library. 
Second,  the  well  used  documents  will  have  plenty  of  usage  statistics  and 
be  well  'indexed',  while  unused  books  will  have  no  statistics — a  seem¬ 
ingly  equitable  arrangement.  Third,  even  the  information  on  one  usage 
of  a  document  may  prove  more  valuable  thar.  the  information  supplied  by 
the  indexer  of  that  document.  Fourth,  usage  information  is  not  pur¬ 
ported  to  be  a  cure-all  which  will  replace  all  of  the  current  types  of 
selection  information.  It  is  felt  to  be  a  supplemental  source  of 
selection  clues  which  should  grow  in  importance  as  more  user  feedback  is 
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Now  let  us  return  to  the  initiel  assumptions  and  note  that  the 
number  of  people  who  enter  a  library  is  by  no  means  an  indication  of 
the  amount  of  time  spent  in  the  study  of  printed  material.  It  is  merely 
an  indictment  of  current  library  practices.  If,  in  fact,  information 
were  made  available  to  research  workers  right  in  their  offices  through 
the  type  of  computer  time-sharing  system  described  in  Section  1.31,  then 
the  amount  of  feedback  available  from  users  should  radically  change, 

2 .22  Supporting  Arguments 

Thus  far  in  this  section  we  have  cited  two  early  proposals  that 
document  selection  be  based  on  user  feedback.  We  have  quoted  both  a 
theoretical  and  a  practical  objection  to  such  an  approach  and  have 
attempted  to  answer  these  objection .. .  Let  us  now  turn  to  some  of  the 
positive  arguments  favoring  user  feedback  which,  to  this  author  at  leas^ 
are  compelling  reasons  why  document  retrieval  should  be  based  on  infor¬ 
mation  from  the  user. 

The  first  argument  has  already  Deen  alluded  to  in  Section  1.26. 

In  this  section  the  need  for  dynamic  indexing  was  observed.  It  was 
noted  that  it  is  impossible  for  an  indexer  to  foresee  all  of  the  possible 
applications  of  a  paper  at  any  giver,  point  in  that  paper's  history  and 
especially  noc  Just  after  it  is  written. 

To  account  for  the  changing  relationships  and  new  applications  of 
papers  in  a  collection,  a  library  must  be  supplied  with  information. 

Such  information  regarding  the  changing  nature  of  the  corpus  must  come 
from  the  three  participants  in  the  library  process --author,  indexer, 
and  user. 


.  V*  , * V* ****** 
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To  require  indexers  to  periodically  re-index  the  collection  would 

Wm- 

be  financially  impossible.  Many  libraries  find  it  difficult  to  even 
initially  index  each  incoming  document. 

The  textual  Information  placed  in  the  document  by  the  authors 
offers  little  help  also.  Take,  for  example,  a  research  worker  who 
publishes  a  new  discovery.  A  terminology  which  eventually  evolves  to 
describe  that  discovery  may  be  markedly  different  from  the  language  of 
the  initial  paper.  And  It  would  be  a  rather  momentous  task  to  develop 
a  thesaurus  which  could  connect  the  groping  language  of  the  basic  paper 
with  the  codified  terminology  which  eventually  results. 

Thus,  the  user  is  left  as  the  one  participant  in  the  library 
system  who  is  continually  interacting  with  the  collection  and  could 
introduce  dynamic  indexing  into  the  system. 

Let  us  note  at  this  point  that  citation  information  in  newly  added 
documents  represents  a  specialized  type  of  user  information  (the  author 
acting  as  a  user  of  the  old  file),  and  as  such  can  act  in  the  same  way 
as  usage  information  to  give  the  system  a  changing  indexing  structure. 
Some  other  advantages  of  this  source  of  indexing  information  were  noted 
in  Sec.  1.33. 

The  second  argument  in  support  of  the  utilization  of  user  feedback 
concerns  the  quality  of  the  indexing  which  results  thereby.  The  advant¬ 
age  of  having  the  indexing  done  by  people  actually  immersed  in  a  given 
research  area  can  hardly  be  overemphasized.  Hitherto  neglected  refine¬ 
ments  and  distinctions  can  be  made,  the  structure  of  the  field  as  the 
actual  worker  sees  it  can  be  established,  and  many  unintentional 
blunders  can  be  avoided. 


J»r*. . 
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It  should  be  noted  that  the  quality  of  indexing  by  usage  is  a 
controllable  parameter.  lfcke  ,  for  example,  the  users  of  articles  in 
the  Physical  Review.  This  group  of  people  represents  a  highly  know¬ 
ledgeable  and  motivated  segment  of  the  population  which  should  be  able 
to  form  valid  links  between  documents.  If,  however,  the  quality  of  the 
resulting  indexing  is  still  insufficient,  the  system  could  be  designed 
to  accept  feedback  from  only  a  segment  of  the  population--say  the  faculty 
but  not  the  students.  This  could  even  be  made  a  parameter  specifiable 
by  the  user  so  that  he  could  use  the  feedback  from  that  segment  of  the 
population  which  most  closely  fitted  his  own  background. 

A  third  reason  for  indexing  by  user  feedback  is  that  it  may  be 
possible  to  do  it  as  a  by-product  of  normal  library  use  and  thus  avoid, 
to  some  extent,  the  high  cost  of  indexing  which  currently  burdens  a 
library. 

2.23  Collecting  Usage  Information 

Let  us  now  discuss  the  problem  of  how  the  intellectual  decisions 
needed  from  the  user  can  best  be  obtained.  The  sets  of  citations  found 
in  articles  form  one  readily  available  source  of  sets  of  documents  that 
have  been  Judged  mutually  pertinent.  The  data  used  by  the  experimental 
portion  of  this  project  was  taken  from  this  source.  (See  Sec.  6.22) 

Let  us  consider  for  a  moment  whether  a  retrieval  system  could  be 
designed  which  was  based  on  usage  data  of  the  type  described  in  Sec.  2.1. 
One  major  difficulty  would  be  to  devise  some  way  of  encouraging  the 
user  to  supply  the  system  with  the  data  needed.  Some  possible  ways 
this  might  be  accomplished  are  the  folowing: 


1.  The  user  finds  that  the  system  automatically  disseminates  to 

him  new  articles  of  interest  if  he  has  provided  profiles  of 
his  Interests  in  the  form  it  sets  of  papers  of  known  interest. 

2.  Hie  user  finds  that  in  interacting  with  the  retrieval  program 

he  converges  on  papers  of  interest  more  rapidly  if  he  tells 
the  system  whether  esch  paper  presented  is  of  interest  or  not. 

3.  The  user  contributes  sets  of  related  papers  to  the  system 

because  he  wishes  to  improve  its  usefulness  to  himself  and 
others . 

it.  Certain  users  are  provided  monetary  remuneration  for  supply¬ 
ing  the  system  with  sets  of  related  documents. 

2 .3  The  Purpose  of  Measures  of  Relatedness 

The  next  question  that  arises  after  one  has  accepted  the  idea  that 
information  selection  might  appropriately  be  based  on  some  type  of  usage 
data  concerns  the  form  that  this  data  should  be  expressed  in.  One 
might  propose  that  each  usage  set  be  treated  the  same  way  as  a  subject 
heading  or  descriptor  set  with  its  label  being  the  name  of  the  user 
that  generated  the  set.  Under  this  scheme  one  might  retrieve  all  of  tie 
papers  of  interest  to  a  given  user  or  all  of  the  papers  vhicn  have  been 
found  of  mutual  interest  with  a  selected  paper.  Indeed  the  ability  to 
answer  these  types  of  questions  is  a  valid  cspability  to  equip  a 
retrieval  system  with. 

However,  there  ore  some  significant  differences  between  the  sets  of 
papers  generated  by  users  and  the  sets  of  papers  generated  by  some  type 
of  indexing  scheme.  First,  there  is  the  fact  that  any  given  paper  occurs 
in,  at  most,  only  a  handful  of  indexing  categories , while  it  might 
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possibly  occur  in  a  very  large  number  of  user  sets.  Second,  there  can 
be  any  number  of  user  sets  centering  around  a  given  area  of  research, 
but  this  area  would  be  normally  covered  by  only  one  subject  category. 
IMrd,  usage  seta  would  be  continually  added  to  the  system,  but  new 
categories  would  be  added  infrequently. 

All  this  adds  up  to  the  fact  that  users  who  attempt  to  extract 
information  from  usage  files  with  normal  matching  techniques  will 
probably  be  overwhelmed  with  the  non-uniform,  massive,  fluctuating 
nature  of  this  type  of  data. 

Some  type  of  statistical  measure  is  needed  which  will  combine  and 
summarize  the  results  of  many  user  interactions.  The  specific  charac¬ 
teristics  which  this  measure  should  have  are  discussed  in  Chapter  III. 


PART  TWO:  THEORETICAL  DEVELOPMENT 


The  three  chapters  of  this  part  describe  the  theoretical 
model  on  which  the  research  project  is  based.  There  are  three 
closely  related  components  of  the  model. 

Chapter  III:  Measure  of  Relatedness 
Chapter  IV:  Cluster  Definition 
Chapter  V:  Search  Procedure 

The  experimental  system  which  was  devised  to  test  the 
applicability  of  the  model  to  a  real  world  situation  will  be 
described  in  Part  Three.  It  16  hoped  that  this  organization 
will  help  in  keeping  the  abstract  ideas  of  the  model  separate 
from  the  particular  physical  Implementation  which  was  developed 
to  test  them.  It  may  be  somewhat  misleading,  however.  In 
actuality  the  model  was  not  completely  developed  before  the 
implementation  began.  It  was  continually  revised  and  improved 
as  various  versions  of  experimental  systems  were  programmed, 
tested  and  then  discarded.  What  is  described  in  this  and  the 
next  part  is  the  current  model  and  test  program. 


■  *•  .  • 
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CHAPTER  III 

MEASURE  OP  RELATEDHESS 

The  first  step  in  establishing  the  conceptual  basis  of  the  research 
project  is  the  selection  of  a  measure  of  the  relatedness  between  docu¬ 
ments.  To  this  end  a  sample  space  will  be  defined  and  a  probability 
distribution  assigned  to  it.  Then  a  measure  oased  on  these  probabil¬ 
ities  will  be  selected  and  some  of  its  characteristics  noted.  Finally 
the  document  network  generated  by  the  measure  will  be  described. 

3.1  Sample  Space 

In  order  to  motivate  the  choice  of  our  mathematical  model,  we 
regard  each  interaction  of  a  user  with  a  library  as  a  partitioning  of 
the  stack  into  two  disjoint  subsets  of  documents:  one  containing  all 
the  documents  of  Interest  to  the  user  and  the  other  containing  the  rest 
of  the  documents.  Each  interaction  is  assumed  to  have  a  single  purpose 
in  the  sense  that  all  documents  of  interest  are  of  interest  for  the 
same  purpose. 

There  are  theoretically  2°  such  partitionings  possible  for  a  stack 

of  n  documents.  Row  let  us  think  of  a  discrete  collection  of  2°  points 

(a  sample  space1'*'),  each  representing  one  of  the  possible  partitionings. 

These  points  can  be  identified  uy  n-bit  binary  numbers,  x^...xn>  where 
tb 

x,  is  1  if  the  i  document  is  in  the  subset  of  Interest  and  0  If  It  is 
in  the  subset  of  no  Interest  for  the  partition  in  question.  (A  super¬ 
script  will  be  used  to  denote  the  value  of  a  variable: 


For  a  given  user  population  nnd  document  collection  s  probability 
distribution  p(x^...x^)  can  be  assigned  to  the  sample  space.  Each 
?(x^...xq)  may  he  regarded  as  the  probability  that  a  user  chosen  at 
random  from  the  population  will  partition  the  document  collection  vlth 
the  partition  xi***xn‘ 

Compound  events  can  be  defined  in  terms  of  the  simple  events  repre¬ 
sented  by  the  sample  points.  For  example,  p(x^) ,  the  probability  that 
document  1  will  be  of  interest  to  some  user  can  be  obtained  by  sunning 
the  probabilities  of  all  points  for  which  x  *1. 

p(xl)*X  p(*ix2--*xn) 

V*xn 

Similarly  p^^x^),  the  probability  that  documents  1  and  2  will  be 
found  to  be  of  interest  Jointly,  can  be  obtained  by  summing  up  the 
probabilities  of  all  points  for  which  x^*l  and  x^l. 

pU^x*)-  £  p(x^x3...xn) 
x3*  "Xn 

In  the  sections  that  follow  we  will  want  to  talk  not  only  about 
the  abstract  theoretical  values  of  these  probabilities,  but  also  about 
their  estimated  values  as  obtained  from  experimental  data.  Suppose  that 
there  is  information  available  on  a  large  number  of  partitionings  of  a 
library.  Let  us  make  the  following  definitions. 

R;  Total  number  of  partitionings  of  the  library  that  are 
available. 

R^ :  Rumbe-  of  partitionings  in  which  document  i  occurs  in  the 

subset  of  Interest, 

H  :  Rumber  of  partitionings  in  which  loth  documents  1  snd  J 
*  J 

wcur  in  the  subset  of  Interest. 

Based  on  these  R's  estimates  of  the  probabilities  can  be  made  as 


1*2 


follows: 


p(Vj)*yLi 


etc. 

The  partitioning  data  employed  In  these  estimates  may  result  from 
experimental  evidence  other  than  actual  user  Interactions  with  the  stack 
of  documents  In  question.  For  instance,  one  might  partition  the  stack 

on  the  basis  of  whether  or  not  the  documents  cite  a  given  document,  or 

on  the  basis  of  whether  or  not  they  contain  a  particular  word  In  their 

titles.  As  a  matter  of  fact,  the  experimental  system  described  In 
Chapter  VI  uses  partitionings  based  on  whether  or  not  the  documents  cite 
a  given  document  because  these  were  readily  available  while  actual  usage 
data  were  not. 

This  use  of  another  type  of  partitioning  data  (other  than  usage 
data)  by  the  experimental  aystem  is  considered  acceptable  here  since 
the  purpose  of  the  experimental  portion  of  the  project  is  to  permit  an 
Investigation  of  general  properties  of  the  theoretical  model  that  should 
be  largely  independent  of  the  precise  values  of  the  probability  esti¬ 
mates  . 

3.2  Criteria  for  Selecting  a  Measure  of  Helstedness 

We  have  slresdy  noted  in  Sec.  l.Jl*  that  a  nuaber  of  measures  of 
'relevance'  have  been  s  .jested  for  js  In  information  retrieval.  Some 
of  the  more  vldeiy  known  of  these  measures  ire  tabulates  in  Appendix  A. 
The  differences  between  the-  »re  partially  due  to  the  feet  that  they 
were  designed  for  different  purposes  end  partially  due  to  the  varied 


backgrounds  of  the  people  who  proposed  them.  Some  of  them  have  a  theo¬ 
retical  basis  in  probability,  statistics,  or  information  theory;  others 
are  of  an  ad  hoc  nature. 

In  Sec.  2.3  we  discussed  why  a  measure  of  relatedness  was  needed 
for  this  project.  The  purpose  of  such  a  aieasure  is  not  to  rate  the 
individual  or  Joint  merit  of  the  documents  In  the  stack,  but  rather  to 
represent  their  relationship  in  terms  of  frequency  of  use  and  co-use. 

To  this  end  it  was  decided  that  the  measure  selected  should  have  the 
seven  characteristics  listed  below. 

Hot  all  of  the  measures  of  Appendix  A  are  expressible  in  terms  of 
the  theoretical  probabilities  of  the  last  section.  Therefore,  for  pur¬ 
poses  of  comparison  we  shall  express  these  seven  criteria  in  terms  of 
the  frequency  counts  on  which  the  estimated  probabilities  are  based. 

The  IT  s  are  as  defined  in  the  last  section,  C  is  the  measure  of  related¬ 
ness  between  documents  i  and  J,  and  R“SjT  means  that  R  monotonlcally 
increases  with  S  as  T  is  held  constant. 

1.  Co-occurrence  Factor  CasHi.| 

The  measure  should  monotonlcally  increase  with  the  number  of 
co-occurrences  in  the  subset  of  interest  of  the  documents  in  question  if 
all  other  factors  are  held  constant.  ConsH-r,  for  example,  a  pair  of 
documents  (i,j)  and  another  pair  (r,s).  If  the  H's  are  the  same  for 
both  pairs  except  that  then  the  relatedness  between  1  and  J 

should  be  greater  than  the  relatedness  between  r  and  s. 

2.  Other  Usage  Penalty  Factor  C**l/H.  I 

I 

7!'.-  measure  should  monotonlcally  decrease  as  the  number  of 
occurrences  of  one  of  the  documents  increases-  all  other  factors  r»ing 


UL 


held  constant.  That  Is,  If  document  i  Is  used  a  larger  number  of  times 
but  not  in  conjunctiva  with  document  J,  then  the  relatedness  between  i 
and  J  should  decrease. 


3.  Co-occurrence  Ratio  Factor 


c=VHi 


N,» 


If  the  ratio  or  fraction  of  the  number  of  co-occurrences  of 
document  i  with  document  J  to  the  total  occurrences  of  document  i  in¬ 
creases,  the  measure  should  increase  also.  Sote  that  this  criterion  is 
not  a  ci  ^sequence  of  i  and  2. 

h.  Function  of  Probability  Estimates  Only  C(H./M,  N,/H,  H  ,/B) 

The  measure  should  depend  only  on  the  ratios  of  frequency 
counts  which  are  used  to  estimate  the  probabilities.  As  long  as  these 
ratios  remain  constant  the  measure  should  not  change. 

5.  Statistical  Independence 

The  one  bench  mark  that  is  available  for  measures  is  the 
statistical  independence  of  the  events  in  question.  It  would  seem  log¬ 
ical  that  if  tue  occurrence  of  two  documents  ere  statistically  indepen¬ 
dent,  their  measure  of  relatearess  should  have  the  value  0. 

6.  Theoretical  Basis 

A  measure  that  has  a  solid  theoretical  oasis  is  to  be  pre¬ 
ferred  ever  one  vL.ch  has  been  developed  by  trial  and  error. 

7.  Ease  of  Use 

The  best  naasure  i  ;  a  simple  one  that  is  easy  to  calculate 
and  manipulate. 


3.3  Sell-. tion  of  a  Measure 

let  us  now  evaluate  the  measures  of  Appendix  A  in  terms  of  the 
criteria  of  the  last  section.  Measures  (l)  and  (2)  have  no  theoretical 
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basis  (Criterion  b)  ana  are  not  0  for  statistically  independent  events 
(Criterion  5).  The  Cbi  Square  Formula  (5)  is  not  expressible  in  terms 
of  the  probability  estimates  (Criterion  h ) .  The  value  of  the  Cosine 
Formula  (b)  for  statistically  independent  events  is\/q>(x^x^)  vhich  is 
neither  0  nor  even  constant.  The  Average  Correlation  Coefficient  (7) 
does  not  satisfy  Criteria  1,  2,  or  3. 

This  leaves  Measures  3,  u,  and  S  which  meet  (at  least  partially)  all 
of  the  criteria  listed.  Measure  8  was  selected  for  this  research  pro¬ 
ject  because  its  foirjdation  in  information  theorj  has  led  to  some  very 
interesting  and  useful  results. 

The  use  of  Measure  (b)  in  document  retrieval  was  first  proposed  by 
R.  M.  Fenox '.  In  its  more  general  fora  it  expresses  the  degree  to  vhich 
a  set  of  events  are  correlated  in  terms  of  their  individual 

and  joint  probabilities. 


*  leg 


p(x*...x^) 


(1) 


The  base  of  the  logarithm  function  used  in  the  formula  ar.d  through¬ 
out  the  remainder  of  this  paper  will  be  assumed  to  be  2.  This  will  mean 
that  the  unit  of  correlation  will  be  the  "bit". 

If  only  2  events,  i  end  j,  are  considered,  then  the  coefficient  is 

euual  to  the  mutual  information,  l(x^;x^),  between  the  2  events  as  de- 

i  u 

?0 

fined  in  information  theory  . 

(  1  *1 

ii  11 

C(x~x.)  «  l(x  ;x  )=  log  - 1  '1 —  (2) 

1  J  1  J  p(xJ)p(*J) 

Let  us  relate  the  probabilities  of  formulae  (l)  and  (2)  to  the 
probabilities  of  document  usage  defined  over  the  sample  space  of  the 
preceding  section.  The  event  is  now  the  occurrence  of  document  i  in 


p 

I 

> 
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a  user's  set  of  interest.  Hie  correlation  C(x^Xj)  is  the  degree  to 
which  the  two  documents,  i  and  J,  are  taken  to  he  mutually  pertinent. 

The  approximation  to  C  in  terms  of  the  estimated  probabilities  will 
be  denoted  by  the  symbol  C. 


c(xH} 


log 


P(x^) 

P(x^h>Uj) 


KM,,  ,  , 

~  iog  - LL  ,  c  (xjxb 

Vj 


3.1i  Practical  Considerations 

In  order  to  calculate  the  measure  of  relatedness  C  for  any  arbi¬ 
trary  set  of  documents  selected  from  a  collection  of  n  documents,  one 
would  have  to  estimate  and  perhaps  store  at  least  2n  ^  probabilities. 
This  is,  of  coarse,  out  of  the  question  for  any  reasonably-sized  docu¬ 
ment  file.  If  C  is  to  be  use<i  some  approximating  simplification  must 
be  made. 

Let  us  now  note  that  this  correlation  coefficient  C  can  be  expanded 
in  terms  cf  mutual  information  terms  as  follows^0: 


C(x*...x*)  »  }_  l(x*;x^)  I(x^;x^;x^)  *  ... 

A  i,j=l  "  i,j,k=l  *  J 

(i/j)  / 


where 


I(x1;x2)  =  log 


p(xxx2) 

p(x1)p(x2) 


p(x.x  Mx.xJpCx^xJ 

=  log  - — - - £.2 - 

P ( Xx )p( x2 )p ( x^ )p( xxx2x3 ) 


etc. 

It  has  been  proposed  that  C  be  approximated  by  the  first  summation 
in  this  series,  and  that  the  other  summations  be  dropped  as  higher- 
order  effects.  There  are  some  theoretical  reasons  which  would  lead  one 
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to  believe  that  this  would  result  in  a  good  approximation  to  How¬ 

ever,  we  shall  re6t  our  case  here  on  practical  necessity  and  not  go  into 
the  details  of  these  theoretical  arguments. 


t)*5  r  Kx 
u/j) 


ii,  y 

i»xi)  *  L 

(i/j) 


log 


p(x^) 

p(x^)p(xj) 


For  this  approximation  one  need  only  estimate  and  store  n  univariate 
and  (®)  bivariate  probabilities  in  order  to  obtain  the  correlation 
between  events  and  subsets  of  events. 

Through  the  same  epproach  one  can  obtain  an  approximation  to  the 
correlation  between  any  two  subsets  of  events-- 


C[(jL...x*)(y*  ..y*)]  ~  X  I(xhy^) 

lj‘1  J 


If  these  subsets  overlap  then  one  or  more  of  the  terms  in  the 
series  becomes  the  self  correlation  of  the  event. 


C(x^) 


log 


p(x^) 


log 


p(x^)p(x^)  p(x^) 


3.5  Characteristics  of  the  Measure  for  Document  Pairs 

The  measure  of  relatedness  is  0  for  two  statistically  independent 
events : 

p(x^)  *  p(x^)p(xj) 

For  events  occurring  together  less  often  than  if  they  were  statistically 
independent,  C  is  negative  and  for  events  occurring  together  more  often 
C  is  positive. 

Theoretically  the  range  of  C  is  from  -  00  to  +oa.  However,  there  is 


i»a 


a  statement  that  can  be  made  about  the  upper  bound.  Since  p(x^Xj)  cannot 

be  larger  than  p(x^)  or  p(x^)  the  following  inequalities  hold: 

1 

<  log 


C(xixJ) 


log 


p(xJxJ) 


p(x^)p(xj) 


p(x^) 


<  log 


p(xj) 


The  quantity  log[l/p(x^)]  is  termed  the  self  information  of  x^  in 
20 

information  theory  .  Thus,  the  correlation  between  two  events  is  always 
less  than  or  equal  to  the  self  information  of  either  event.  Let  us  indi¬ 
cate  this  range  on  the  simple  graph  of  Fig.  3.1* 


1 _ ^ 

<y  fp  /  //'///  //  / 

>  //////  \ 

-oo  . 

•  Max[log(l/p(x^) )] 

Fig.  3.1.  Range  of  measure  of  reletedness. 


Seme  additional  comments  about  the  range  of  the  messure  can  be  made 
if  ve  consider  C,  the  approximation  to  C  based  on  the  estimated  proba¬ 
bilities.  The  maximum  positive  value  of  C  is  (log  R)  and  occurs  when 
R^,  Nj,  and  R^  all  equal  1.  Its  minimum  value  other  than  -oo  is  (2-log  R) 
and  occurs  when  is  1  and  R^  and  Nj  are  R/2.  This  range  is  shown  in 
Fig.  3.2. 


~fh 


-F 


ZT7~  /  / 


7  7  /  /  ~/ 


-»C 


2 -log  N 


log  N 


Fig.  3.2.  Range  of  approximation  to  measure  of  relatedness. 


For  the  test  data  utilised  in  the  experimental  portion  of  this 
project  (see  Sec.  6.1)  it  was  found  that  the  C ' s  were  either  -oo  or  had 
some  positive  value  (see  Fig.  3.3).  The  lower  limit  of  (2-log  N)  in 
Fig.  3.2  is  changed  in  Fig.  3.3  since  all  of  the  N^s  of  the  test  data 
are  much  less  than  R/2.  The  new  minimum  of  C  occurs  when  Nij=l  and  Ri 
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and  N.  are  maximum  (called  (N, )  ). 

J  i  max 


1  // 

y  r- 

;/////' 

tj 

-oo 

)  _ 

N 

r 

l  max 

Fig.  3-3.  Range  of  measure  of  relatedness  for  test  data. 


The  range  ter  the  test  data  is  due  not  so  much  to  the  fact  that  the 
occurrence  of  the  documents  in  the  test  file  are  never  statistically 
independent  as  to  the  fact  that  such  statistical  independence  can  only 
be  detected  with  a  very  large  data  base.  Consider  documents  i  and  J 
with  p(x^),  p(xj)  *  0.0001.  If  x^  and  x^  are  statistically  independent, 
then  p(x^Xj)«10  In  order  for  any  of  the  probability  estimates  to  be 
this  small  we  would  need  at  least  1C)3  partitionings.  Many,  many  more 
partitionings  than  this  would  be  needed  if  one  wanted  to  have  accurate 
estimates  of  the  occurrences  of  such  'are  events.  With  fever  partition¬ 
ings  these  events  either  never  occur,  resulting  in  pixJx^J-O,  or  do  occur 
-.lth  the  estimate  for  p(x^Xj)  being  larger  than  it  should  be.  This  is 
the  phenomenon  observed  for  the  test  data.  Even  if  there  were  correla¬ 
tions  that  were  0  or  slightly  negative  they  would  be  pushed  to  -oo  or  to 
some  positive  value  because  of  the  limited  number  of  partitionings 
availaole. 

It  is  conjectured  that  this  will  be  the  situation  in  most  practical 
cases  for  some  time  to  come.  In  a  very  large  document  collection 

C  7 

(10-10  items)  the  probability  of  occurrence  of  any  one  document  is 
probably  small,  say  10  ^  or  10  This  would  require  a  file  of  10fa  to 

y 

10  partitionings  to  measure  statistical  independence  which  would  take 
considerable  time  and  effort  to  collect.  In  a  small  document  collection 
the  probability  of  occurrence  of  any  one  document  could  be  larger  but  the 


number  of  partitionings  available  would  undoubtedly  be  less  also. 

It  should  be  pointed  out  that  this  measure  will  assume  some  value 
for  every  pair  of  documents  in  the  stack  (except  perhaps  documents  thnt 
have  never  been  used).  Even  two  documents  that  have  never  co-occurred 
together  (N.  .*0)  are  related  by  the  value  -00. 

^  J 

A  few  comments  should  be  made  about  the  value  -oo.  It  is  not  a 

realistic  value  for  the  correlation  bet  .een  most  documents  because  it 

implies  that  there  is  absolutely  no  chance  of  two  documents  co-occurring. 

As  has  already  been  pointed  out  this  arises  because  the  probabilities  may 

end  up  exactly  zero.  A  much  more  practical  and  reasonable,  approach  to 

the  problem  would  be  to  make  all  correlations  between  document  pairs  for 

which  N.  “0  equal  to  some  finite  negative  value  instead  of  -oo.  More 
1  J 

will  be  said  on  the  choice  of  this  negative  value  (K)  later  (Sec.  i*.5). 


Fig.  3.U.  Revised  range  of  measure  for  test  data. 


Another  feature  of  the  selected  measure  is  that  it  is  non-directional. 
That  is,  the  value  of  the  measure  from  document  i  to  j  is  the  same  as 
from  j  to  i. 

3.6  Document  Networks 

It  has  been  suggested  that  measures  of  the  relatedness  between  docu¬ 
ments  should  be  metrics^.  This  would  require  that  a  measure  C  exhibit 
the  following  properties: 

(1)  C(x,x)=0 

(2)  C(x,y)>0  (if  xfy) 

(3)  C(x,y)=C(y,x) 


*•1 *■ 


(M  C(x,y)+C(y,z)£  C(x,z) 

The  measure  under  consideration  does  meet  property  (3)*  It  might 
conceivably  be  made  to  fit  properties  (l)  and  (2)  through  some  type  of 
normalization  or  restriction.  There  appears  to  be  no  way  to  make  it 
have  property  (L),  the  triangle  inequality.  Indeed,  it  would  be  rather 
disturbing  to  this  author  ^f  it  did  have  property  (U). 

Ba. -Hlllel  has  pointed  out  in  the  comment  cited  in  Sec.  2.21  that 
many  of  the  important  aspects  of  a  document  collection  (except  physical 
location)  cannot  be  made  to  satisfy  the  triangle  inequality  and  cannot, 
therefore, be  represented  by  metrics.  His  conclusion  was  that  measures 
derived  from  these  features  (joint  usage,  common  citation, etc. )  are  use¬ 
less.  Our  conclusion  is  that  such  measures  should  not  be  required  to  be 
metrics. 

The  idea  that  a  metric  space  is  the  appropriate  model  for  a  docu¬ 
ment  collection  is  rejected  here.  If  one  desires  a  model  to  aid  in  his 
mental  picture  of  a  document  collection,  a  simple  network  is  suggested. 
Each  document  can  be  considered  a  node  and  the  link  between  two  nodes 
can  be  assigned  the  value  of  the  measure  of  relatedness  between  the 
corresponding  documents.  It  has  already  been  pointed  out  that  the 
measure  of  relatedness  chosen  links  every  node  (document)  to  every  other 
node.  It  might,  therefore,  be  easier  to  visualize  the  sub-network  con¬ 
sisting  of  only  positive  links.  This  is  the  visual  picture  found  most 
helpful  to  the  author. 

Thus  far  we  have  considered  the  problem  of  generating  a  document 
network  from  a  set  of  probabilities,  let  us  now  consider  the  reverse 
process.  If  one  draws  a  document  network  and  arbitrarily  chooses  the 
values  to  be  assigned  to  the  links,  can  a  set  of  probabilities  be  found 
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which  could  have  generated  the  netvork7  This  question  is  of  interest 
because  if  there  is  only  a  certain  class  of  networks  that  are  realizable 
from  sets  of  probabilities,  then  we  need  focus  our  attention  only  on  that 
class. 

Theorem.  For  every  document  network  (with  the  restriction 
that  the  values  of  the  positive  links  be  finite)  there  is  at  least 
one  set  of  probabilities  which  could  have  generated  it. 

Proof.  The  first  step  in  proving  this  theorem  will  be  to  select  a 
set  of  values  for  the  elementary  probabilities,  p(x^. . .x^).  It  will  then 
be  shown  that  the  set  selected  yields  the  correct  values  for  the  links 
of  the  network  in  question  and  forms  a  valid  set  of  probabilities  (i.e. 
each  value  is  in  the  range  0  to  1  and  their  sum  is  l). 

Before  proceeding  let  us  define  the  following  symbols. 

n:  number  of  documents  in  the  network  (n>2). 

C(x^xJ):  value  of  the  network  link  between  documents  and  x^. 

C  :  maximum  value  of  C(x^x^). 
max  i  J  _c 

k:  the  lesser  of  the  two  quantities:  (l/n)  and  (l/n)2  maX. 

It  will  also  be  convenient  to  introduce  at  this  point  one  additional 
notation  convention.  Let  us  allow  the  values  of  the  variables  in  the 


p(x^...xn)'s  which  differ  from  0  to  be  specified  by  a  statement  following 
a  colon  as  well  as  by  superscripting.  For  example: 

p(x1...xn:xi-l)  =  P(x1...x,_1  xiX;>1...xn) 

We  are  now  ready  to  state  the  values  for  the  elementary  probabil¬ 


ities,  p(x^...x  ).  Four  possible  classes  will  be  considered. 

(1)  All  p(x^,,,xn)  for  which  three  or  more  x's  are  1. 

p(x^...xn:  at  least  3  x's»l)«0 

(2)  All  p(x^...x^)  for  which  two  x's  are  1: 
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C(x^x^) 

p(xi...xn:x1,xj-l)-k2  2  1  ^  for  all  i,J  (i/j). 

(3)  All  p(x. ...x  )  for  which  one  x  Is  1: 
l  n 


r-  C(xJxJ) 

p(x1...xn:xi-l)-k-k‘:  L  2  J 


(h)  The  p(x,...x  )  for  which  no  x  is  1. 
l  n 


for  all  i. 


p<*°  •x'ViW/a)  Z 

A  “  j  1  _  ^ 


The  motivation  behind  the  selection  of  these  values  will  become 
clearer  as  the  discussion  proceeds.  It  may  be  helpful,  however,  to  note 
three  of  the  underlying  ideas  at  this  point. 

(1)  Each  p(xp)  is  to  have  the  same  value. 

p(x^)-k 

(2)  The  value  of  the  p(x^)'s  is  to  be  chosen  so  that  the  pfx^J's 

can  be  adjusted  to  give  the  desired  C(xjxj)'s. 

C(x}x{) 

p(xjxj)*k2  2  J 

(3)  The  only  elementary  events  that  are  allowed  to  occur  are  those 

with  zero,  one  or  two  documents  in  the  subset  of  interest. 
Let  us  prove  that  the  elementary  probabilities  as  selected  above 
generate  the  correct  values  for  the  links  of  the  document  network.  Pre¬ 
liminary  to  doing  this  we  will  determine  the  values  of  the  p(x|')'s  and 

p(x*Xj)'s. 

p(xi)“  X  p(x1**-xn) 

all  p's  for 

which  x. “1 

i  n 

'  plvw11  p(vvwl> 

J?1 


ft 


Tom 


n 


«  K-K 


J-l 

j/i 


C(x^xb 

2  J 


♦k 


n 

i 


2c(Vj) 


ph*}  ' 


k  for  all  1. 

"I  P(xl-’xn) 

all  p'a  for 
which  x^x^-1 


P(V^ 


p(x^. .  •xn:xi»xi}"^ 

.  C(x2x^)  .  .  . 

^2  *  ^  for  ®H  i«J  (i/j)* 


c(x^xj)  -  log 


p(x^xj) 

p(x^)p(xj) 


log 


C(x"x 
s  2  J 

(k)  (k) 


-  C(xJxJ) 


for  all  i,j  (i/j). 


In  order  for  the  set  of  values  selected  for  the  p(x1...x[j)  a  to 
a  valid  set  of  probabilities,  their  su®  *ust  be  1. 


s  -  £  p(xI---xn) 

over  all  x’ s 


n 

■  1/2  Y.  P^xi*,*xn:xi,xj“1^?~1  p(xi...xn'-xiV'l^x1*--xn 
i,J*l  *"1 

i/j  n  ,  ,  n  1  1 

n  .,11,  JL  r'i*)  V  C(x,x ,) 

*/  .  i  1  i  t  ^  \ 


0  0\ 


3*1 


We  aust  also  pro 


,ve  that  the  values  selected  for  the  ^(x^.-.x^)  s 


«...  ■* 
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are  in  the  range  0  to  1.  The  values  for  the  first  class  of  probabili¬ 
ties,  p(x^>..xQ:at  least  )  x's  «l),  are  all  0  and  thus  automatically  in 
the  range.  The  values  assigned  to  the  probabilities  of  the  second  class, 
p(x^. ..xn:x^,x4*l),  can  be  shown  to  be  in  the  range  by  the  following 
argument. 


-C  -C(x>t) 

<(l/n)2  max<(l/n)2  1  J 
C( x^x^) 

>  1  J  *(1/ n)  and  k<(l/n) 


p  C(x. x  )  « 

‘  k2?  iJ<fl/n^2 


K  2 

P  C(*V) 

0<k22  J  <  1 


^(Vn  r 

lly 


Next  let  us  show  that  the  values  assigned  to  the  probabilities  of 
the  third  class,  p(x^. .  .xfj:x1«l),  are  in  the  correct  range. 

?  n  C(xjxb 

k-k2  2  1  J  <  k<  l/n<l 

j/i 


-*E 


~  C(xjx^) 

I  N 


> k-k(n-l)(l/n) >  0 


j;1 

j/i 


/  0  0\ 

Finally  let  ua  check  the  range  of  p(x^...xn;. 


l-nk*(k2/2)£  2  XlXj<l-nk*(l/2)(n)(n-l)(l/n)-l-^  -  ± 


<1 


i,J-l 

t/j 


JL  r(x^x^) 

l-nk*(k2/2)  Y_  2  1  J  £l-nk> 


>l-n(l/o)*0 


5  ...1-1 
i/i 


gw 


! 
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CHAPTER  IV 
DOCUMENT  CLUSTERS 

In  the  last  chapter  a  measure  of  relatedness  between  documents  was 
defined  and  a  document  network  based  on  the  measure  was  described.  The 
next  step  to  be  taken  is  to  formulate  a  definition  for  what  constitutes 
a  subset  (cluster)  of  highly  inter-related  documents  based  on  this 
measure.  The  purpose  of  such. a  definition  is  to  provide  the  user  who 
has  requested  information  from  the  system  with  a  set  (cluster)  of  papers 
which  is  Judged  to  be  related  to  his  interest. 

The  exact  form  that  a  request  for  information  can  take  and  the  pro¬ 
cedure  used  to  translate  a  request  into  an  answer  cluster  will  be  de¬ 
scribed  in  Chapter  V.  The  way  a  cluster  is  obtained,  modified,  and 
stored  in  the  experimental  system  devised  for  this  project  will  be 
covered  in  Chapter  VI.  In  this  chapter  we  shall  confine  our  attention 
to  what  constitutes  an  appropriate  cluster  of  documents.  Two  types  of 
clusters  will  be  defined  and  analyzed,  and  certain  modifications  wdll  be 
described  which  make  one  of  the  definitions  acceptable. 

ii.l  Local  Maximum  Clusters 

The  cluster  definition  which  was  first  proposed  and  tested  turned 
out  to  be  the  one  which  was  eventually  selected  for  this  project.  Let 
us  formally  define  it  and  then  discuss  its  characteristics. 

In  this  definition  and  in  the  remainder  of  this  thesis  we  will  find 
use  for  the  following  set  operators. 
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U :  Set  union--(AUB)  is  the  set  of  all  documents  in  set  A  or  in 
set  B. 

0-  Set  intersection — (AflB)  is  the  set  of  documents  in  hoth  set  A 
and  set  B. 

C:  Set  inclusion — (aCb)  means  that  the  set  A  is  included  in  the 
set  B. 

X:  Set  complementation- -X  is  the  set  of  all  documents  not  in  X. 
Definition:  Local  Maximum  Cluster 

A  local  maximum  cluster  is  defined  to  he  any  subset  of  docu¬ 
ments  )  for  which  both  of  the  following  conditions 

1  r 

hold. 

1.  Every  document  in  X  is  positively  correlated  to  the 
remainder  of  X. 

C[x. (X  fix?)]  >0  for  all  x.CX„ . 

l  a  l  la 

2.  Every  document  x.  not  in  X  is  negatively  correlated  to  X  . 

J  a  a 

C(x  X  )<0  for  all  x  C X~. 

j  a  J  ** 

(Note  that  zero  is  arbitrarily  classed  as  a  negative  value.) 

A  local  maximum  cluster  is  so  named  because  every  possible  single 

change  (addition  or  deletion)  to  the  cluster  will  result  in  a  decrease 

in  its  internal  correlation.  The  internal  correlation  C(X)  of  a  subset 

X  is  defined  to  be  the  sum  of  the  links  whose  ends  both  terminate  in  the 

subset.  If  X  is  a  cluster,  then 
a 

C(Xa)>C(Xp)  for  all  which  differ  from  Xa 

by  a  single  document. 

Five  specific  characteristics  of  local  maximum  clusters  have  been 
selected  for  discussion  below. 

Size.  The  average  size  of  the  clusters  produced  by  the  local 


>%  -T-.  V?  -  v»-  —  •  „„  ..  __ 

r 

•6.V. 

f 

I 
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maximum  definition  is  very  much  a  function  of  the  correlation  assigned 
to  document  pairs  that  have  not  co-occurred  together  (11^*0).  It  has 

t  already  been  noted  that  although  this  correlation,  K,  is  -oo  by  the 

formula,  some  finite  value  is  more  appropriate  (Sec.  3. ill)  If  K  is 

made  positive,  then  there  will  be  only  one  cluster  consisting  of  the 
total  file.  If  K  is  made  Just  slightly  negative,  then  the  clusters 
formed  will  be  disjoint  and  consist  of  all  documents  connected  by  one  or 
more  paths  of  positive  links.  If  K  is  made  very  negative,  the  only 
clusters  will  be  those  sets  of  documents  wherein  every  document  has  co¬ 
occurred  with  every  other  document. 

Overlap.  It  is  fairly  obvious  that  local  maximum  clusters  can  over¬ 
lap.  Consider  the  network  of  Fig.  U.l  in  which  all  the  links  shown  have 
the  value  +5  and  all  the  links  not  shown  have  the  value  -6.  The  two 
local  maximum  clusters,  (x^x^x^)  and  (x^x^x^)  overlap  through  Xy 

Links  shown  are  +5 
Links  not  shown  are  -6. 

Fig.  U.l.  Network  with  overlapping  clusters. 

Coverage .  The  following  simple  theorem  shows  that  local  maximum 
clusters  may  not  cover  all  the  documents  in  the  network. 

Theorem.  Document  networks  exist  which  have  documents  that  are 
not  included  in  any  local  maximum  cluster. 

Proof.  First  consider  a  document  that  has  never  co-occurred  with 
any  other  document.  Such  a  document  does  not  prove  the  theorem  because 
it  is  included  in  a  cluster  which  consists  of  only  the  document  itself. 
Now  consider  the  network  of  Fig.  h.2.  The  only  cluster  is 
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(xjX^x^Xj).  The  document  x^  cannot  form  a  cluster  by  itself  since 
and  x^  are  positively  correlated  to  it.  It  cannot  fora  a  cluster  with 
Xg  and  since  x^  and  x^  are  positively  correlated  to  the  set  (x^x^ ) 
with  the  value  5+5 -6*lu  Thus  x^  occurs  in  no  cluster.  QED 


Links  shown  are  +5. 

Links  not  shown  are  -6. 

Fig.  U.2.  letvork  with  a  document  (x^)  in  no  cluster. 

Although  local  maximum  clusters  do  not  cover  all  possible  documents 
in  a  network,  one  is  at  least  assured  of  the  following — 

Theorem.  Every  document  network  contains  at  least  one 

local  maximum  cluster. 

Proof.  The  proof  will  he  constructive.  A  local  maximum  cluster 
can  be  formed  by  successively  making  single  changes  (additions  or  dele¬ 
tions)  to  s  subset  of  documents  as  outlined  in  the  following  3-step 
procedure. 

1.  Pick  a  document  at  random  as  the  initial  member  of  the  subset, 

2.  If  every  document  outside  the  subset  is  negatively  correlated 
to  the  subset  and  every  document  Inside  the  subset  is  positvel,  corre¬ 
lated  to  the  subset,  then  quit.  The  local  maximum  cluster  has  been 
found - 

3.  Otherwise  either  add  a  positively  correlated  document  that  is 
not  in  the  subset  or  delete  a  negatively  correlated  document  that  is  in 
the  subset.  It  doesn't  matter  which  is  done,  but  only  one  change  must 
be  made.  How  return  to  step  2. 

This  procedure  is  assured  of  termination  if  the  document  set  is 


finite  because  step  3  always  increases  the  internal  correlation  (sum  of 
the  internal  links)  of  the  subset  being  formed.  There  is,  of  course,  an 
upper  limit  to  the  internal  correlation  of  any  finite  set  of  documents. 

QED 

Structure.  Local  maximum  clusters  can  form  the  type  of  hierarchal 
ftructure  indicated  by  the  following  theorem. 

Theorem.  A  local  maximum  cluster  can  be  a  subset  of 
another  local  maximum  cluster. 

Proof.  Again  we  can  use  an  example  to  prove  the  theorem.  In  the 
document  network  of  Fig.  4.3  there  are  five  local  maxima: 

(x^),  (x^),  (x3x1|),  (x^),  (x1x2x3xu). 

The  first  four  of  these  are  subsets  of  the  fifth.  QED 


Links  shown  are  +5. 
Links  not  shown  are  -6. 


Fig.  4.3.  Network  with  hierarchal  cluster  structure. 


Relatedness.  Now  consider  the  problem  of  whether  local  maximum 
clusters  form  veil  related  sets. 

Theorem,  Totally  unrelated  subsets  of  documents  can  occur 
together  in  a  local  maximum  cluster.  By  totally  unrelated  we 
mean  that  no  document  in  one  set  is  positively  correlated  to  a 
document  in  the  other  set. 

Proof.  This  theorem  can  be  proved  by  another  simple  example.  The 
set  (x^XjjXjX^)  of  Fig.  4.4  forms  a  cluster  and  yet  there  are  no  positive 
links  between  the  set  (x^)  and  the  set  (x^).  QED 


Links  shovm  are  +? . 

Links  not  shown  are  -3. 

Fig.  lub.  Cluster  containing  unrelated  subsets. 

The  inclusion  of  unrelated  subsets  in  the  same  cluster  is  considered 
an  undesirable  characteristic  for  a  cluster  to  have.  The  reason  why  this 
is  so  involves  the  design  of  the  procedure  of  Chapter  V.  It  was  decided 
that  the  procedure  could  be  greatly  simplified  if  one  were  to  assume 
that  each  request  fo”  information  from  the  system  has  only  one  purpose. 

A  person  who  has  several  areas  of  interest  on  which  he  desires  informa¬ 
tion  is  expected  to  make  a  separate  request  for  each  area.  It  follows 
that  if  each  request  has  a  single  purpose,  then  the  document  clusters 
which  are  to  answer  these  requests  should  not  be  divisible  into  unrelated 
subsets . 

it. 2  Subset  Clusters 

In  an  attempt  to  keep  completely  unrelated  sets  of  documents  from 
becoming  part  of  the  same  cluster,  a  definition  was  devised  based  on  the 
addition  of  subsets  or  the  deletion  of  subsets  of  documents  as  opposed 
to  the  single  changes  allowed  in  the  local  maximum  definition.  This 
definition  was  accepted  as  the  one  most  suitable  for  this  project  for  a 
number  of  months.  In  this  section  we  shall  describe  it,  note  its  charac¬ 
teristics,  and  explain  why  it  was  finally  discarded. 

Definition  1;  Subset  Cluster 

A  subset  cluster  is  defined  to  be  any  set  of  documents 

X  »(x  ,...,x  )  for  which  both  of  the  following  conditions 

a  ax  ar 


hold 


62 

1.  Every  subset  of  documents  XQ  included  within  X  is 

P  “ 

positively  correlated  to  the  remainder  of  Xa> 

c[  Xp  ( Xan ]  >  o  for  all  X^CXo . 

2.  Every  subset  of  documents  X  external  to  X  ia 

pa 

negatively  correlated  to  Xa* 

C(x  X  )<0  for  all  XCf. 

pa  pa 

It  is  worth  noting  that  Condition  2  of  the  local  maximum  cluster 

definition  is  equivalent  to  Condition  2  above.  I?  each  document  external 

to  X^  is  negatively  correlated  to  X^,  then  certainly  all  external  subsets 

are  negatively  correlated  to  Xa*  Conversely  if  each  subset  is  negatively 

correlated  to  X&,  then,  of  course,  single  documents,  being  subsets,  are 

also  negatively  correlated  to  Xa>  It  should  also  be  pointed  out  that  all 

subset  clusters  are  local  maximum  clusters  but  not  vice  versa. 

Next  let  us  present  an  alternative  definition  of  a  subset  cluster. 

Definition  2:  Subset  Cluster 

A  subset  cluster  is  defined  to  be  any  set  of  documents 

Xa*(xa  ,...,xa  )  for  which  both  of  the  following  conditions 
1  r 

hold. 

1.  The  internal  correlation  of  X  as  defined  in  Sec.  li.l 

a 

is  greater  than  tht  sum  of  the  internal  correlation  of  the  dis¬ 
joint  subsets  of  Xa  created  by  any  arbitrary  partitioning, 
r 

C(X)>£  C(D.)  for  all  partitionings  in  which 

i"l  (DiU..*lJD _)*X  and  D  H0."  null  »et. 

1  r  a  1  J 

2.  The  sum  of  the  internal  correlations  of  X  and  some  subset 

a 

Xp  external  to  Xa  ia  greater  than  or  equal  to  the  internal  correla¬ 
tion  of  the  set  formed  by  adding  Xp  to  Xa* 


6j 


C(Xa)+C(Xp)>C(XaUXp)  for  all  XpCXa- 

Theorem.  Definition  1  and  Definition  2  for  subset  clusters 

are  equivalent. 

Proof.  Hie  equivalence  of  the  second  conditions  of  both  definitions 
is  fairly  obvious.  The  equivalence  of  the  first  conditions  requires  same 
verification. 

Let  us  assume  that  Cond.  1  of  Def.  2  holds  and  partition  the 
clusters  into  two  subsets. 

c(xa)>c(Xp)K(xan^) 

But:  C(Xa)-C(Xp)^(Xan^)^t(Xp)(Xanp] 

C[(Xp)(Xan^)]>0 

This  last  result  is  Cond.  1  of  Def.  1. 

Nov  let  us  assume  that  Cond.  1  of  Def.  1  holds  and  partition  the 
cluster  into  the  disjoint  subsets  D^, ...,Dr.  By  Def.  1: 

C[(D1)(XanD]")]>0  for  all  D^,...,Dr 


But: 


C(Xj-  Y.  C(D  )»l/2  t  C[(D  ){X  flD  : 
a  i-1  1  i-1  i  a  i 


)) 


i-1 
r 

C(xa)>£  0(0^ 


Thus  if  Cond.  1  of  Def.  1  is  true,  Cond.  1  of  Def.  2  is  also.  Q£D 
Let  us  discuss  now  sons  of  the  characteristics  of  subset  clusters. 
The  comments  and  theorems  on  cluster  size,  overlap  aud  coverage,  which 
were  made  in  Sec.  U.l  for  local  maximum  clusters,  hold  for  subset 
clusters  also  with  the  exception  that  one  is  no  longer  assured  of  having 
at  least  one  cluster  in  any  given  document  network. 
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Theorem.  There  exist  document  networks  which  contain 
no  subset  clusters. 

Proof.  Examination  of  each  of  the  2^  possible  subset;  in  the  net¬ 
work  of  Fig.  li.5  reveals  that  none  of  them  satisfy  the  two  conditions 
necessary  for  subset  clusters.  QED 


Links  not  shown  are  -5. 


Fig.  ii .5 -  Networx  containing  no  subset  clusters. 

Structure.  Next  we  note  that  a  hierarchal  structure  is  no  longer 
possible  with  subset  clusters. 

Theorem.  No  subset  cluster  X,,  ->0  be  included  within  another 
-  P 

subset  cluster  X  . 

a 

Proof.  Let  us  assume  that  X  and  X.  are  subset  clusters  and  that 
-  a  p 

X  CX  .  Since  X  is  a  cluster  and  XjCX  ,  then  by  Cond.  1  of  the  defini- 
P  a  a  pa 

tion:  __ 

c(yxanyi>o 

But  since  X^  is  a  cluster  and  then  by  £onc|. 

c[XjU/lxp!<o 


which  contradicts  the  previous  inequality  *£D 

Relatedness-  In  the  Inst  section  It  fo;r.ted  out  that  one  of  the 
difficulties  «ith  local  maximum  clusters  lies  in  th"  fact  that  even  com¬ 
pletely  uncorrelated  sets  of  djcu.-w.-its  can  occur  in  the  same  cluster. 

It  was  for  this  reison  that  the  »-n»t  de'lr.i tlon  was  devised.  In  sue- 
set  .lusters  on;  ;s  assured  ty  iefln.ti  >r.  that  no  subset  of  the  cluster 


is  negatively  correlated  to  the  remainder  ,r  tftr  cluster. 


test 
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Utility.  The  problems  of  coverage  and  hierarchy  did  not  prove  to  be 
aerious  drawbacks  to  the  subset  definition  of  clusters.  An  extension  to 
the  definition  was  devised  which  allowed  all  documents  to  be  in  at  least 
one  cluster  and  provided  for  hierarchal  relationships.  This  extension 
involved  applying  a  bias  to  the  links  of  the  network.  (See  Sec.  li.lu) 

The  reason  the  aut  ,'t  definition  was  finally  abandoned  was  because  no 
method  could  be  found  that  would  isolate  subset  clusters  with  a  reason* 
able  amount  of  effort. 

Consider  for  a  moment  the  probieei  of  checking  Condition  1  of  the 
subset  definition.  One  must  determine  whether  there  is  a  partitioning 
of  a  set  of  documents  which  results  in  two  subsets  that  are  negatively 
correlated  to  each  other.  The  brute  force  method  is  to  try  every  parti¬ 
tioning.  This  would  involve  2n  tests  for  a  set  if  n  documents  and  would 
certainly  be  too  much  processing  for  sn  n  of  20  or  JO  even  on  a  high 
speed  digital  computer.  Several  effort*  were  fade  to  devise  a  more 
efficient  method.  Although  they  were  not  entirely  successful,  it  migut 
be  well  to  briefly  document  a  couple  of  them. 

u.J  finding  Subset  Clusters 

In  the  first  method  for  finding  subset  clusters  which  was  investi¬ 
gated,  an  effort  was  mad*  to  determine  !.f  »  f«at  vitluHing  of  a  sot  'm ted 
which  would  result  tr.  t»o  negatively  correlated  subsets.  •  parti¬ 

tioning  is  called  «  'spilt'  o'  th«  set  in  the  'oi. owing  4i»  -ussion. 

In  the  other  wpprjerh  emphasis  -as  s»l  on  *.h»  small,  <-t/ 
highly  *orr«lated  i  .U'ti  rsllel  *e-r.e;s  « ;  thlr.  th*  set  -hr 

»r.  .M  hside  to  *  n.  1:;--  ar.1  these  .,n‘ii  a  s. 


L.31  Locating  Splits 


We  wi3h  to  devise  a  aetbod  which  will  determine  whether  a  set  of 
documents  can  he  split  into  two  negatively  correlated  subsets  and  to 


locate  where  such  splits  are.  Some  of  the  theorems  that  were  developed 
for  this  purpose  will  be  stated  below.  In  the  interests  of  brevity  the 
proofs  will  not  be  given.  The  symbols  used  in  these  theorems  are 
defined  as  follows. 

n  -  number  of  documents  in  S,  the  sets  under  consideration. 

a  -  masher  of  docuaents  in  a  subset  A  of  S . 

b  -  nua'oer  of  documents  in  a  subset  B  where  B-S Ha .  (aH>»n,A|jB*S) 

K  -  negative  value  assigned  to  links  for  which  N^-0. 

C  .  -  snallest  value  of  the  linss  for  which  N. t/0.  It  will  be 
min  lj' 

assumed  in  the  following  theorems  that  C  .  is  positive. 

min 

(See  Sec.  Jo.) 

C  -  icrgest  positive  link  in  the  network. 

C  -  number  of  linns  in  the  set  S  which  have  the  value  K. 

Theorem  1:  Consider  the  partitioning  of  a  set  of 

documents  into  the  subsets  A  and  B. 

Part  A:  Only  those  parritionings  which  satisfy  the  following  in¬ 


equality  car.  possibly  result  in  splits. 


(a)(bK 


sfer1 ) 


Part  B:  A  necessary  condition  for  a  partitioning  to  result  in  a 
split  is  that  the  partitioning  must  oe  crossed  by  at  least 
r  negative  links  where: 

(a)(b)(Cxin) 


1 
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Part  C:  A  sufficient  condition  for  a  partitioning  to  result  in 
a  split  is  that  the  partitioning  be  crossed  by  at  least  s 
negative  links  where: 


s 


(«)(b)(C  ) 

_ pax 

c  ♦  Ik  I 

max  1  1 


d  -  iiO  (i*0  of  the  190  links  are  negative) 

By  Part  A  of  the  theorem  (a)(b)  must  be  xess  than  90  to  allow  a 
split.  Therefore  partitionings  with  distributions  a:b  «  10:10,  9:11, 
8:12,  and  7:13  cannot  possibly  result  in  splits.  This  immediately 
eliminates  about  90%  of  the  possible  partitionings  as  candidates  for 
splitting  the  set.  Unfortune tely  there  are  some  60,li50  partitionings 
that  still  must  be  considered  which  is  still  out  of  the  question. 

However  if  the  liO  negative  links  are  all  bunched  on  only  5  of  the 
nodes  (8  per  node), then  by  Part  B  of  the  theorem  only  6l  partitionings 
can  possibly  cause  splits  and  these  can  easily  be  checked. 

If  only  10^  of  the  links  are  negative  (19  instead  of  ho) ,  then  only 
partitionings  with  a:b  «  1:19  and  2:18  can  cause  splits.  There  are  210 
l  such  partitionings  and  a  check  of  these  would  also  be  possible. 

However  in  the  general  case  may  be  small,  d  may  be  large,  and 

the  negative  links  may  not  be  so  fortuitously  arranged  so  that  the  parti¬ 
tionings  which  must  be  examined  may  still  remain  very  large. 

Theorem  2  is  concerned  with  the  possibility  of  finding  splits  of 
the  set  S  as  it  is  being  formed. 
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Theorea  2.  Consider  the  possibility  of  a  set  of  documents 
being  split  by  the  addition  of  another  document.  Three  statements 
can  be  made. 

1.  If  the  new  document  is  positively  correlated  to  each  item 
in  the  set,  then  no  split  can  be  created. 

2.  If  a  split  is  created,  it  must  be  crossed  by  at  least 
one  newly  added  negative  link. 

3.  The  sum  of  the  newly  added  links  crossing  any  split 
created  must  be  negative. 

The  next  two  theorems  will  help  to  determine  whether  the  set  S  is  a 
subset  cluster  when  it  contains  one  or  more  documents  that  are  positively 
correlated  to  all  of  the  other  documents  in  S. 

Theorem  3.  If  a  set  of  n  documents  has  d  or  more  documents 
that  are  positively  linked  to  every  other  document  in  the  set, 
then  the  set  has  no  splits. 


n  |  K  | 


Theorem  h.  Assume  that  a  set  of  documents  has  splits.  Mow 
remove  all  those  documents  that  are  positively  correlated  to 
every  other  document  in  the  set.  The  reduced  set  must  also 
have  splits. 

The  sum  of  the  links  connecting  documents  in  the  subset  A  to  docu¬ 
ments  in  B  is  termed  the  cross  correlation  of  the  partitioning  which 
created  A  and  B.  The  following  three  theorems  relate  to  this  cross 
correlation. 

Theorea  $.  The  cross  correlations  of  all  possible  parti¬ 
tionings  of  a  document  set  are  equal  if  and  o"ly  f  every  link 
has  the  value  0,  (n>3) 
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Theorem  6.  The  cross  correlations  of  all  possible  parti¬ 
tionings  of  a  document  set  of  size  a:b  are  equal  if  and  only 
if  every  link  has  the  same  value. 

Theorem  7.  The  average  cross  correlation  of  the  parti¬ 
tionings  of  size  a:b  is  C( S)(a )(b )/(2 )  where  C(S)  is  the  total 
internal  correlation  of  the  set. 

It. 32  Forming  Kernels 

Another  method  which  was  considered  as  a  way  for  determining  if  a 
set  was  a  subset  cluster  was  to  form  highly  correlated  kernels  within 
the  set  in  question  and  thereby  try  to  locate  possible  splits.  The  ker¬ 
nels  might  initially  be  those  subsets  wherein  every  document  is  posi¬ 
tively  correlated  to  every  other  document.  These  sets  could  then  be 
combined  in  various  ways  to  see  if  any  splits  appeared.  The  following 
two  theorems  relate  to  this  approach. 

The  symbols  used  are  as  defined  in  the  last  section  and  as  follows: 

Cavg  '  averp*e  the  positive  links  of  the  set. 

-  '.;■(} j  ith  disjoint  kernel  of  the  set  S. 
i)  "U D.C ^ 

D^Dj  ■  null  set  for  all  i,j  (i/j). 

Theorem.  If  the  sum  of  the  internal  correlations  of  a  set 
of  disjoint  kernels  is  greater  than  or  equal  to  the  total 
internal  correlation  of  the  set,  then  there  is  at  least  one 
split  in  the  set.  ^ 

In  other  words,  if:  £  C(D.  )kc(s) 

i-1 

then  S  has  at  least  1  split. 

Theorem.  A  sufficient  condition  for  having  at  least  one 


spilt  in  s  set  Is  that  the  set  contain  at  least  d  negative 


links  where: 


d  - 


t 

(?)C  —  C(D. ) 

avg  1«1  v  i' 

“c  -TIkI 

avg  1  * 


h.U  Biased  Clusters 

In  this  section  an  extension  or  modification  to  the  cluster  defini¬ 
tions  is  proposed.  It  was  initially  devised  in  order  that  subset 
clusters  could  have  a  hierarchal  structure.  It  was  found  to  be  a  useful 
modification  to  local  maximum  clusters  also. 

As  a  way  of  introducing  the  concept  of  a  biased  cluster,  let  us  con¬ 
sider  a  large  cluster  (either  local  maximum  or  subset)  of  documents 
covering  a  rather  broad  field  of  interest.  There  will,  of  course,  be 
users  who  want  all  of  the  documents  in  such  a  clvster,  but  what  about 
the  users  whose  interests  are  very  specific  and  who  want  only  a  small 
portion  of  the  cluster?  As  yet  there  has  been  no  provision  for  such  a 
narrowing  of  interest.  Subset  clusters  and  many  local  maximum  clusters 
are  not  decomposable.  We  shall  now  present  the  theoretical  basis  of  a 
method  which  will  allow  a  cluster  to  be  reduced  to  a  more  specific  set 
or  enlarged  to  a  more  general  set. 

Consider  a  set  of  documents,  W-(w^, . . .,wr) ,  which  forms  a  cluster 
in  the  overall  document  network.  The  problem  of  retrieving  a  portion  of 
this  cluster  is  regarded  as  equivalent  to  the  problem  of  finding  a 
cluster  in  the  sub-library  consisting  only  of  W. 

In  order  to  show  how  this  might  be  done  let  us  define  a  new  sample 
space  which  has  only  2r  points  instead  of  the  2n  points  of  the  original 
sample  apace.  Each  point  in  the  new  space  represents  a  possible  parti- 
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tioning  of  W.  To  distinguish  between  the  probabilities  of  the  two 
sample  spaces,  the  probabilities  of  the  old  sample  space  will  be  given 
a  subscript  'a'  and  the  probabilities  of  the  new  sample  space  a  sub¬ 
script  'P'.  Let  the  probabilities  assigned  to  the  points  of  this  new 
sample  space  be  initially  equal  to  the  marginal  probabilities  of  the 
corresponding  events  over  the  old  sample  space. 


Pp(vwr) 


pa(vr..vr)  *  £  Pa^Xl‘*‘Xn^ 
over  all  x 
not  i*>  W. 


The  marginal  probability,  p(i(w®. . .w®),  is  the  sum  of  the  probabil¬ 
ities  of  all  those  elementary  events  in  which  none  of  the  documents  in  W 
are  in  the  subset  of  interest.  Since  these  events  are  irrelevant  when 
one  is  considering  only  the  sub-library  W,  let  ub  set  p^(v^. ..w°)  equal 
to  0.  Such  a  step  requires  that  the  other  Pp(v^,..wr)'s  all  be  increased 
by  a  normalizing  factor  k.  The  final  values  for  the  probabilities 

assigned  to  the  new  sample  space  can  now  be  specified. 

/  0  On  - 

VVV  “  0 

Pp(w1...wr)  -  kpa(wlf..wr)  for  all  p^fw^ . .wr Jexcept  Pp(w°...w®) 

k  »  l/[l-pa(w°...w°)] 

Nov  let  us  consider  the  effect  of  this  change  in  the  sample  space 
on  the  correlation  of  any  two  documents  in  W. 


Ca(wlw2) 


cp(wiw2> 


log 


log 


Pa(ui)pa(w2) 

PP(tfjV2)  x 

pp(wi)pp(w2} 


■  log- 


(h)pa(w^w2 ) 

U)pa(vJ)(k)pa( 
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«.<■&> 


-  log  (k) 


CP^V1W2^  “  Ca^wlW2 ^  *  log  ^ 

Thus  the  correlations  for  the  sub-library  can  be  obtained  by  merely 
subtracting  a  constant  or  bias  from  the  correlations  for  the  full  library. 

An  alternative  way  to  describe  this  approach  is  through  the  frequency 
counts  used  in  making  the  probability  estimates.  Instead  of  considering 
all  the  available  partitionings  of  the  document  file,  let  us  consider 
only  those  partitionings  in  which  one  or  more  of  the  documents  in  W  occur 
in  the  subset  of  interest.  Let  us  denote  the  counts  based  on  this  re¬ 
stricted  set  of  partitionings  by  the  letter  M  and  use  N  for  the  original 
counts. 


^  *  M  for  all  i  in  W. 
Nij“  MiJ  for  a11  in  W* 


Now  let  us  consider  what  happens  to  the  approximation  to  C  based  on 
the  probability  estimates  with  the  new  frequency  counts. 


~  ,  1  1 
Vvj 


M  M 


log 


log 


log 


1 L 


MiMJ 


M  N 


il 


HiNJ 


N  N 


LL  . 


*i*j 


log 


N 

»T 


S’p(wJwJ)  ‘  Ca(wJvJ)  -  log  (N/M) 


Here  again  we  note  that  we  can  in  effect  reduce  the  size  of  the 
library  under  consideration  by  merely  subtracting  a  constant  from  each 
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correlation  value. 

In  an  analagous  manner  ve  can  increase  the  size  of  the  library  and 
thereby  obtain  larger,  more  general  clusters  by  adding  some  bias  to  each 
correlation  in  the  network. 

We  now  observe  that  of  the  three  measures  which  meet  the  criteria 
outlined  in  Sec.  3.2  (3, li,  and  8)  only  Measure  8  allows  this  type  of 
narrowing  an  broadening  of  the  request  range.  Measures  3  end  h  are  in¬ 
sensitive  to  any  change  in  the  size  of  the  library  or  partitioning  file. 

One  final  question  arises  concerning  the  biasing  of  the  value  K 
assigned  to  links  for  which  N  «0.  One  could  either  let  the  bias  affect 
all  link,  equally  or  one  could  look  upon  K  as  a  fixed  value  which  is  not 
changed  by  the  bias.  The  latter  approach  was  rather  arbitrarily 
selected. 

We  are  now  ready  to  define  what  is  meant  by  a  biased  cluster. 

Definition:  Biased  Cluster 

A  biased  local  maximum  cluster  has  the  same  definition  as 

a  regular  local  maximum  cluster,  but  a  non-zero  bias  has  been 

applied  to  the  document  network  in  vbich  the  cluster  is  formed. 

The  same  is  true  of  a  biased  subset  cluster. 

In  summary,  a  simple,  easy-to-use  method  has  been  suggested  which 
will  allow  the  size  of  clusters  to  be  increased  or  decreased.  Some 
arguments  have  been  presented  which  show  that  the  method  has  a  sound 
theoretical  basis. 

Final  Cluster  Decision 

The  local  maximum  definition  of  clusters  was  reconsidered  after  no 
general  method  for  finding  subset  clusters  was  found.  It  was  pointed 
out  in  Sec.  h.l  that  local  maximum  clusters  were  considered  unacceptable 
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because  totally  unrelated  subsets  of  documents  could  be  part  of  the 
same  cluster,  The  following  theorem  and  lemmas  show  that  this  diffi¬ 
culty  can  be  avoided  by  selecting  an  appropriate  value  for  K. 

During  the  remainder  of  this  section  it  will  be  assumed  that  all  of 
the  links  for  which  N.  /0  are  positive  (See  Sec.  3.5),  If  this  condi- 
tion  does  not  hold  then  the  theorems  and  lemmas  which  follow  can  be 
restated  in  terms  of  links  for  which  and  links  for  which 

instead  of  positive  and  negative  links. 

Theorem.  Each  document  in  a  local  maximum  cluster  of  n 

documents  is  positively  linked  to  over  half  of  the  remaining 

n-1  documents  if  K<-C 

max 

Proof.  By  definition  each  document  in  a  local  maximum  cluster  is 
positively  correlated  to  the  remaining  (n-l)  documents  in  the  cluster. 
Now  if  the  positive  links  are  smaller  or  equal  in  magnitude  than  the 
negative  links,  then  it  stands  to  reason  that  there  must  be  more  of  the 
former  to  yield  a  positive  sum. 

Lemma.  Consider  a  local  mpximum  cluster  that  is  parti¬ 
tioned  into  2  subsets,  X^  and  .  with  the  larger  if  they 

differ  in  size.  If  K<-C  ,  every  document  in  X  has  at 

max  a 

least  one  positive  link  to  the  other  subset. 

Lemma.  In  a  local  maximum  cluster  with  K<-C  there 

1  ■  *  BLBX 

can  be  no  subset  that  is  totally  uncorrelated  (has  no  positive 

links)  to  the  remainder  of  the  cluster. 

The  choice  of  K<-C  does  not  insure  that  a  local  maximum  cluster 

•*"  (MX 

will  be  free  of  splits  and  thus  be  a  subset  cluster.  Subsets  can  still 
be  negatively  correlated  to  the  remainder  of  the  cluster.  But  it  does 
Insure  that  the  rather  strong  type  of  relatedness  expressed  by  the  above 
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two  lemmas  will  exist  for  each  partitioning  of  a  local  maximum  cluster. 

Another  advantage  to  choosing  is  that  it  provides  the 

system  with  a  very  simple  test  of  whether  two  documents  can  he  in  the 
same  local  maximum  cluster. 

Theorem.  If  K  <-C  then  two  negatively  linked  documents 
can  occur  in  a  local  maximum  cluster  together  only  if  they  are 
positively  linked  to  at  least  one  common  document. 

Proof.  Consider  a  local  maximum  cluster  of  n  documents.  Assume 
that  there  are  two  negatively  correlated  documents,  and  x^,  in  the 
cluster.  By  the  previous  theorem  x must  he  positively  correlated  to 
over  half  of  the  (n-l)  other  documents  in  the  cluster.  Since  is  not 
positively  correlated  to  it  must  he  positively  correlated  to  more 
than  half  of  the  remaining  (n-2)  documents.  This  is  true  of  x^  also. 

Thus  they  must  be  positively  correlated  to  at  least  one  common  document. 

Next  let  us  consider  what  value  should  he  assigned  to  K  to  insure 
that  k£-C  .  In  Sec.  3.5  it  was  shown  that  the  largest  value  that  the 
estimated  correlation  can  possibly  take  is  (log  K)  where  I  is  the  number 
of  available  partitionings  of  the  document  file.  Thus  if  we  make  K  equal 

to  (-log  H)  we  will  be  assured  that  K  S.-C _ . 

So  far  some  reasons  have  been  given  indicating  that  it  might  be 
expedient  from  a  practical  standpoint  to  make  K  equal  to  (-log  K).  Let 
us  now  consider  whether  this  value  for  K  is  Justifiable  theoretically. 

It  was  noted  in  Sec.  3.5  that  if  the  frequency  counts  are  based  on 
a  finite  number  (l)  of  partitionings,  then  none  of  the  probability 
estimates  can  fall  between  0  and  l/lf.  This  results  in  those  correlations 
which  might  have  been  in  the  range  -oo  to  (2 -log  N)  being  estimated  to 
be  -  0®  (or  perhaps  some  value  greater  than  (2-log  M)).  It  was  suggested 
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that  those  correlation  estimates  that  are  -OOby  the  formula  might  be 
more  appropriately  adjusted  to  same  finite  negative  value,  K,  since  a 
correlation  of  -  on  implies  that  there  is  ab  jolutely  no  chance  of  the  two 
documents  ever  occurring  together. 

Thus  K  can  be  considered  an  approximation  to  the  correlations  in  the 
range  -oo  to  (2-log  If)  and  it  vould  seem  appropriate  that  it  assume  some 
value  within  that  range.  Consider  also  what  value  K  should  assume  as  N 
approaches  oo.  It  is  suggested  that  K  should  approach  -oo  as  H 
approaches  oo  since  those  document  pairs  fcr  which  still  equals  0  in 
the  limit  do  in  fact  never  occur  together  and  C(x^xJ)  should  be  -oo. 

There  are  two  other  consequences  to  making  K»-log  N  that  should  be 
noted.  It  gives  the  correlation  a  symmetric  range  about  0  (-log  H  to 
log  N) .  It  also  forces  the  correlation  of  documents  that  have  never 
occurred  together  to  always  be  less  than  the  correlation  of  documents 
that  have  co-occurred  [(-log  H)<(2-log  M)]. 

The  local  maximum  definition  is  therefore  selected  for  use  in  this 
project.  Its  definition  is  extended  to  include  biased  clusters  and  it 
is  required  that  K  ■  -log  H.  Hereafter  ue  will  refer  to  a  local  maximum 
cluster  as  Just  a  cluster. 
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CHAPTER  V 
SEARCH  PROCEDURE 

The  last  component  of  the  th<  iretlcal  model  is  the  procedure  which 
transforms  „  equest  for  inform*  tis.n  into  the  set  of  d^cu.entB  that  coo 
prise  the  answer.  The  first  step  in  describing  the  procedure  will  be  to 
make  a  number  of  definitions.  Then  a  list  of  features  that  a  suitable 
procedure  should  have  will  be  given.  Finally  the  particular  procedure 
developed  for  this  project  will  be  described  and  analyzed. 

5.1  Definitions 

Definition:  Request 

A  request  for  information  from  the  system  is  defined  to  con¬ 
sist  of  two  subsets  of  documents.  One  subset,  Y«(y^, . - . ,y# ), 
contains  those  papers  known  by  the  user  to  be  pertinent  to  the 
current  seerch.  the  other,  contains  U.ase  papers 

that  are  known  to  be  not  pertinent.  The  1  subset  must  be  non¬ 
empty  but  the  l  subeet  can  be  empty . 

Definition:  Answer 

An  answer  to  e  request  Is  defined  to  be  s  cluster  of 
doctasents  which  Includes  the  Y  subset  of  the  request  and 
excludes  the  2  subeet. 

Definition:  Clustering  Procedure 

Any  algorithm  which  transforms  a  request  Into  an  answer 
will  be  termed  *  clustering  procedure  (sometimes  berecfter  Just 
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called  a  procedure).  We  vill  consider  for  this  project  only 
clustering  procedures  which  are  iterative  in  nature  and  which 
on  each  iteration  change  the  contents  of  a  certain  set  of  docu¬ 
ments,  S»(s^,...,su).  Upon  termination  of  the  procedure  S  is 
to  be  the  answer  set.  For  most  of  the  procedures  considered 
here  only  a  single  change  is  made  to  S  on  each  iteration.  The 
S  generated  by  the  ith  iteration  can  be  distinguished  by  a 
subscript  (Sj). 

Definition :  Convergent  Procedure 

A  convergent  procedure  is  one  that  terminates  after  a 
finite  number  of  iterations. 

Definition:  Inconsistent  Request 

A  request  is  said  to  be  inconsistent  if  there  is  no  answer 
cluster  for  any  bias  which  satisfies  the  request. 

Definition:  Ambiguous  Request 

A  request  is  said  to  be  ambiguous  if  there  is  more  than 
one  answer  cluster  which  satisfies  the  request.  Note  that  one 
must  consider  all  possible  biases  in  determining  ambiguity. 
Requests  with  empty  Z  sets  will  generally  be  ambiguous.  This  is 
because  larger  and  larger  answer  clusters  can  be  formed  by  increasing 
the  bias.  For  example,  the  request  of  Fig.  5.1  is  ambiguous  having  the 
following  four  possible  answers. 


Answer 

<*1> 

(ylxl) 

(ylXlX2  ^ 

(y1x1x?x3) 


-oo  -h 
-h  -3 


-3  -*■  *7 


+7  ->•*•00 


,  -‘isSfcWSW-  v--  ^  .* -7-V' 
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Links  not  shown  are  -5 

Myx) 

z*(  ) 

Fig.  5*1*  Ambiguous  Request. 

$.2  Attributes  of  a  Good  Clustering  Procedure 

In  this  section  we  shall  list  some  characteristics  which  the 
clustering  procedure  should  have.  It  will  be  assumed  that  the  definition 
of  a  cluster  of  documents  as  given  in  Chapter  2j  is  suitable.  If  this  is 
the  case,  then  the  basic  objective  of  a  clustering  procedure  would  be  to 
locate  the  appropriate  cluster  in  an  efficient  way. 

1.  Request  Satisfaction 

If  the  request  is  unambiguous  and  consistent,  then  the  procedure 
should  produce  the  one  cluster  which  satisfies  the  request. 

2.  Request  Modification 

If  the  request  is  ambiguous  or  inconsistent,  then  the  procedure  should 
be  able  to  recognize  this  fact  and  should  help  the  user  to  modify  his 
request.  This  suggests  that  the  procedure  should  allow  close  man- 
machine  coupling  so  that  infoimation  generated  by  the  clustering  process 
can  be  presented  to  the  user  for  his  examination  and  modifications  to  the 
request  cen  be  fed  back  into  the  system. 

3 .  Convergence 

The  procedure  should  be  convergent  for  every  possible  request  and 
document  network.  Whether  it  is  forming  an  answer  cluster  or  determining 
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request  ambiguity  or  inconsistency,  it  should  never  fall  into  a  repeti¬ 
tive,  non-terminating  cycle. 

U.  Minimal  Number  of  Iterations 

The  procedure  should  find  the  answer  in  as  few  iterations  as 
possible.  An  excessively  large  number  of  deletions  of  previously  added 
documents  from  the  set  being  formed  would  be  undesirable. 

$.3  Description  of  Procedure 

A  description  and  flow  chart  of  the  procedure  developed  for  tnis 
project  will  be  presented  in  this  section.  An  analysis  of  the  procedure 
will  be  given  in  Sec.  5.5. 

Fig.  5.2  is  a  block  diagram  showing  the  overall  structure  of  the 
procedure.  Before  attempting  to  describe  each  block  in  Fig.  5.2  in 
detail  let  us  make  some  general  comments  about  the  procedure. 

There  are  three  basic  phases  which  the  procedure  can  enter  depending 
on  the  amount  of  bias  required  and  the  relationships  of  various  documents 
and  sets  of  documents. 

Phase  I:  Wo  Bias 

The  procedure  starts  in  this  phase,  remains  in  it  as  long  as  no  bias 
is  required,  and  returns  to  it  from  Phase  II  if  at  some  point  the  bias 
can  be  reduced  to  zero.  The  documents  considered  for  addition  to  S  in 
this  phase  are  those  (positive  to  S)  which  keep  each  y*  in  Y  positive  to 
S  (or  at  least  increases  its  correlation  to  S)  and  keep  each  in  Z 
negative  to  S  (or  at  least  decreases  its  correlation  to  S).  Of  theae 
candidates  the  one  with  the  highest  correlation  to  S  is  selected  for 
aduition  to  S.  If  at  some  point  there  are  no  more  documents  that  are 
positive  to  B,then  the  procedure  terminates.  If  there  are  documents 
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that  are  positive  to  S  but  none  of  them  meet  the  above  r.nUtions  with 
respect  to  Y  and  Z,  then  it  is  concluded  that  some  bias  will  be  needed 
and  Phase  II  is  entered. 

Phase  II;  Bias 

In  Phase  II  the  bias  is  either  made  positive  enough  to  keep  all  the 
y^'s  positive  to  S  or  made  negative  enough  to  keep  all  the  z^'s  negative 
to  S.  Cn  each  iteration  those  documents  that  are  positive  to  S  by  the 
current  bias  are  considered  for  addition  to  S.  Of  these  candidates  the 
document  which  requires  the  least  bias  when  added  to  S  is  selected  for 
addition  to  S.  If  at  any  time  the  bias  becomes  zero  the  procedure 
returns  to  Phase  I. 

Vh'a  there  are  no  more  documents  that  are  positive  to  S,  the  pro- 
cedu. s  either  terminates  or  enters  Phase  III.  Actually  certain  constraints 
are  placed  on  the  amount  the  bias  can  change  on  any  one  iteration.  This 
means  that  all  of  the  request  documents  may  not  be  properly  correlated  to 
S  (y^s  positive  to  S  and  z^'s  negative  to  S)  at  the  end  of  Phase  II. 

If  they  are  all  properly  correlated  to  S  (i.e.  t.he  request  is  satisfied), 
the  procedure  terminates.  If  they  are  not  yet  properly  correlated  to  S, 
the  procedure  enters  Phase  III. 

Phase  III:  Monotonic  Bias 

The  purpose  of  this  phase  is  to  either  ms’e  positive  to  S  certain  y^^ 
that  are  not  currently  positive  to  S  or  to  make  negative  to  S  certain 
that  nr*  cuirently  negative  to  S.  Ttoia  is  accomplished  by  allowing  the 
bias  to  move  m  only  one  direction  while  auitable  addition*  and/or 
deletions  nre  mad*  to  S.  Otic  may  not  return  to  Phase  I  or  II  from  Phase 
III.  Phaue  III  and  the  procedure  terminal*  when  the  ' s  and  z^’a  are 
correctly  linked  to  0. 


83 


The  detailed  flow  charts  for  the  general  blocks  of  Fig.  5.2  will  be 

greatly  simplified  if  we  first  define  a  number  of  symbols. 

Flow  Chart  Symbol  Definitions 

0  :  The  null  set. 

Set  intersection  operator. 

|J:  Set  union  operator. 

S:  Set  of  all  documents  not  in  set  S.  (Complement) 

CT:  Set  inclusion:  ACB  means  set  A  is  included  in  set  B. 

Y:  The  set  of  all  documents  specifier  as  interesting  by  the  user. 

Z:  The  set  of  all  documents  specified  as  not  interesting  by  the  user. 

S:  The  set  which  is  being  formed  into  the  answer  cluster  by  the 

procedure.  (YCs) 

P:  The  set  of  all  documents  positively  correlated  to  the  set  S  by  the 
current  bias.  A  document  in  S  is  in  P  if  it  is  positively 
correlated  to  the  remainder  of  S. 

Q:  The  set  of  documents  included  in  P  but  not  in  S  or  Z.  'Rje  document 
to  be  added  to  S  will  be  chosen  from  this  set.  Q'PHsOZ 

T:  The  set  consisting  of  those  documents  in  Q  which  will  not  require 
positive  bias  if  added  to  S.  Document  tj  is  in  T  if  when  it 
is  added  to  S  it  will  do  or.'.-  or  both  of  the  following  opera¬ 
tions  for  every  document  y,  in  Y. 

J 

(l)  Keep  y  positive  tu  the  new  S.  C[y  (SUt, )]  >  0 
J  J  1 

(with  0  bias) 

(?)  Increase  the  correlation  of  y ,  to  S.  C(yjt^)>0 
(with  0  bias) 

V:  The  set  consisting  of  those  documents  in  Q  which  will  not  require  a 
negative  lias  if  tdded  to  S.  Document  is  in  V  if  when  it 


ttli 

ie  added  to  S  it  will  do  one  or  both  of  the  following  opera- 


(1)  Keep  Zj  negative  to  the  new  S.  Cfi^SUv^)] £ 0 

(with  0  bias) 

(2)  Decrease  the  correlation  of  z^  to  S.  '(z^v^^O 

(with  0  bias) 

X:  The  set  of  documents  which  are  candidates  for  addition  to  S.  If 
there  are  one  or  more  documents  in  Q  that  require  no  bias  if 
added  to  S,  then  X  contains  those  documents.  Otherwise  it 
contains  the  documents  that  require  a  change  in  bias  in  only 
one  direction. 

W:  The  set  of  documents  which  are  candidates  for  deletion  from  S.  A 
document  w^  is  in  W  if  it  is  negatively  correlated  to  the 
remainder  of  S  by  the  current  bia6  and  if  it  is  not  included 
in  Y. 

C[wi(snwi)]^0  v^sO? 

f:  Number  of  positive  links  in  the  set  S.  (with  no  bias) 

g^:  Number  of  positive  links  from  document  x^  to  S.  (with  no  bias) 
d^:  Bias  required  for  the  set  (sUx^.  If  x^TfW  then  dA  is  Just 
negative  enough  to  keep  each  z ^  negative  to  (S(Jx^).  If 
x^CvHT  then  d^  is  Just  positive  enough  to  keep  each  y^ 
positive  to  (SUXj).  If  X'Tflv  then  di  is  made  0. 

BIAS:  Current  bias. 

b^:  Allowable  change  in  bias  if  is  added  to  S. 

^•minimum  ( (d1-BIAS),l,10/(f*gi  ),C(x1S)/( f»g,  )) 

(C  above  is  by  current  bins.) 

R:  The  set  of  docuawnts  in  X  that  would  r,e«p  the  tins  at  0  or  allow  it 
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to  be  reduced  to  0  if  added  to  S. 

| BIAS  *  bjJ  ■  0  for  all  x^ZR 

We  are  now  ready  to  present  more  detailed  flow  charts  for  the 
blocks  of  Fig.  5.2.  Fig.  5.3  covers  block  1,  Fig.  5.1*  covers  blocks  2 
and  3,  Fig.  5.5  covers  blocks  i*  and  5,  and  Fig.  5.6  covers  blocks  6-9. 

A  brief  comment  is  made  to  the  right  of  each  step  in  these  detailed  flow 
charts  as  an  aid  to  understanding  them.  More  precise  statements  of 
their  functions  are  given  in  Sec.  5.5. 

5.1*  Earlier  Procedures 

For  historical  purposes  and  for  comparison  and  analysis,  let  us 
briefly  document  some  of  the  earlier  procedures  which  were  considered. 
Procedure  1 

Briefly  this  procedure  transforms  a  request  into  three  subsets— 
A:  the  set  of  documents  related  to  the  request. 

B:  the  set  of  some  of  the  documents  not  related  to  the 
request. 

C:  a  'limbo'  set  of  documents  positively  correlated  to  both 
sets  A  and  B. 

Initially  set  A  contains  only  those  documents  specified  as 
interesting  by  the  user,  and  set  B  contains  those  documents  speci¬ 
fied  as  non-interesting.  On  each  iteration  all  documents  positively 
(negatively)  linked  to  A(B)  and  negatively  (positively)  linked  to 
B(A)  are  added  to  A(B).  Documents  positively  linked  to  both  A  and 
B  are  placed  in  limbo  while  those  negaUvely  linked  to  both  are 
ignored.  All  changes  to  the  sets  A,  D,  end  C  are  made  concurrently 
at  the  end  of  each  iters  ion. 
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1.  Allow  user  to  specify  Initial  Y  and 
Z  sets. 

2.  Put  the  interesting  documents  in  S. 

3.  Indicate  that  the  procedure  is  not 
yet  in  the  third  phase. 

k.  Start  with  an  initial  bias  of  0. 


fig.  5.3.  Initialization 


5.  Check  if  there  are  documents  in  S 
that  are  negative  to  the  remainder 
of  S. 

6.  Point  at  which  inforasation  can  flow 
between  the  user  and  the  system. 

(e.g.  status  of  clustering  procedure, 

data  on  particular  documents,  modi¬ 
fications  to  the  request. etc. ) 

7.  Delete  a  document  from  S. 


Pig.  5.h.  Condition  1  and  Deletions. 


8? 


8. 

9. 


10. 


11. 


12. 


13. 


XU. 


15. 


lo. 


Check  if  there  are  any  more  docu¬ 
ments  positive  to  S. 

'heck  if  there  are  documents  posi¬ 
tive  to  8  that  keep  (or  try  to  keep) 
all  the  y' s  positive  and  all  the 
z's  negative. 

Check  if  there  are  documents  which 
require  a  change  in  bias  in  only 
one  direction.  Mote  that  TUV  * 
(Tnv)U(VOf)  at  this  point. 

Load  the  set  X  with  the  candidates 
for  addition  to  S. 

Check  if  one  or  more  documents  in  X 
can  allow  the  bias  to  drop  to  zero. 


Point  at  which  information  can  flow 
between  the  user  and  the  system. 

(e.g.  status  of  clustering  procedure, 
data  on  particular  documents, 
modifications  to  the  request, etc. ) 


Add  a  document  to  S.  The  document 
x^  is  the  x^  in  R  for  which  C(x^S) 
is  a  maximum.  (Based  on  current 
bias. ) 


Add  a  document  to  S.  The  document 
x^  is  the  x.  in  X  for  which  the 
magnitude  of  the  allowable  new  biac, 
|BlAS*bJ  ,  i*  a  minimum. 


Change  the  bins  if  neceasery.  (Sign 
of  b^  is  modified  by  PHAST  Hi  to 
allow  change  in  one  direction  only. 


Pig.  5.5.  Condition  2  and  Additions. 
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1?. 


18. 


19. 


20. 


21. 


22. 


2J. 


21. 


Tests  for  Request  Documents 
In  Trouble 


Check  If  all  the  document*  In 
Y  are  positive  to  S. 

Check  If  all  the  documents  in 
Z  are  negative  to  S. 


Termination  of  procedure. 
The  answer  cluster  Is  S. 


Phase  III  Bias  Change 

Check  if  this  is  the  first 
time  through  Phase  III. 


Set  PHASE  III  switch  to  allow 
bias  to  change  in  only  one 
direction. 


■fake  maximum  change  in  hiss. 
(The  sign  depends  on  the 
Phase  III  switch.) 


Inconsistent  Request 

The  request  is  considered 
Inconsistent  since  the  bias 
must  go  up  snd  down  simulta¬ 
neously.  The  user  lx  informed 
of  this  fact  and  allowed  to 
aak  questions  and/or  modify 
the  request. 


A  document  is  chosen  for 
deletion  from  Z  if  the  user 
hes  not  already  modified  the 
request. 


Pi*.  5.t>. 


Phase  III  and  other  tests. 


89 


Procedure  2 

This  procedure  is  the  saae  as  Procedure  1  except  that  only  one 
change  Is  made  to  net  A  or  set  B  at  a  tine.  Thus,  the  most  posi¬ 
tively  correlated  document  is  added  and  then  the  most  negative  docu¬ 
ment  is  deleted  from  each  set. 

Procedure  3 

The  basic  difference  between  this  procedure  and  Procedure  2  is 
that  the  criteria  used  to  determine  which  document  to  add  to  set  A 
or  B  is  that  it  be  most  positively  related  to  the  original  request 
instead  of  the  current  trial  subset  (S).  Only  those  documents  that 
ere  positively  correlated  to  S  are  considered  for  addition.  Within 
this  set,  selection  is  on  the  basis  of  correlation  to  the  original 
request. 

Procedure  U 

This  procedure  attempts  to  combine  the  advantage  of  Procedures 
1  and  2.  All  documents  positively  correlated  to  either  sets  A  or  B 
(but  not  both)  should  be  added  to  them  on  the  first  iteration  as  in 
Procedure  1.  Subsequently  only  single  changes  are  made  to  th-  sub¬ 
sets  as  lc  Procedure  2. 

Let  us  briefly  not#  here  why  these  earlier  procedures  'were  rejected. 
All  of  these  procedures  have  a  single  ..ubset  B  into  whien  the  documents 
considered  not  pertinent  to  the  search  are  placed.  This  subset  is 
treated  Just  lies  the  subset  of  pertinent  documents  and  an  attempt  is 
and<*  to  form  it  into  a  cluster  aiso. 

the  difficulty  with  such  sn  approach  can  be  seen  by  th*  example  of 
Pig.  S.7.  By  the  above  procedures  the  non-pertinent  set  B  is  inltlal- 
l  -ed  with  2>(s,i,)i  further  additions  to  B  are  not  possible  beceuse  x, 

4  1  A 
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and  Xj  are  both  negative  to  B.  This  is  because  the  non-pertinent  set  is 
really  not  one  cluster  but  two  clusters.  Since  and  x^  are  negative 
to  B,  one  of  them  can  be  added  to  A.  This  will  make  x^  and  negative 
to  A  and  divert  the  procedure  from  the  desired  cluster.  Basically  what 
has  happened  is  that  the  usefulness  of  the  documents  in  Z  has  been 
hindered  by  requiring  that  they  form  a  single  cluster. 

Links  shewn  are  +5 
Links  not  «hovn  are  -6 

■i  ■ 

Z  -  (z1z2) 

Fig.  5-7.  Example  showing  why  non -pertinent  documents 
should  not  all  be  grouped  into  one  cluster. 

This  would  lead  one  to  suggest  that  perhaps  a  separate  cluster 
should  be  formed  around  each  document  in  Z.  There  are  some  reasons  why 
this  would  not  prove  useful  in  addition  to  the  fact  that  it  would  eat  up 
an  excessive  amount  of  effort  in  the  formation  of  non-pertinent  clusters. 
Consider  the  example  of  Fig.  5.8.  Let  us  assume  that  x^  is  added  to  A 
and  x^  to  B  on  the  first  iteration.  Now  on  the  second  iteration  x^  can 
be  added  to  A  because  it  is  no  longer  positive  to  B.  The  cluster 
(Xl^yi)  not  found  because  the  non-pertinent  cluster  formed 

around  z^  was  (z^x^x^)  instead  of  The  P°int  here  is  that 

the  z^'s  will  be  in  a  number  of  clusters  and  one  does  not  know  exactly 
which  cluster  to  form  around  in  order  to  divert  S  in  another  direction. 
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z-(»x) 

Desired  cluster:  (y^x^x^) 
Cluster  to  be  excluded  by  z^: 


Links  shown  are  +5 
Links  not  shown  are  -6 


friWi* 


Pig  5.8.  Example  of  difficulty  with  forming  clusters 
around  non-pertinent  documents. 


5.5  Analysis  of  Procedure 

Thur  far  the  clustering  procedure  selected  hes  seen  described  and 
flow  charted  and  a  brief  explanation  of  the  purpose  of  each  block  has 
been  given.  Also  certain  earlier  procedures  have  been  briefly  sketched. 
We  shall  now  analyze  the  effectiveness  of  the  selected  procedure  in 
terms  of  the  objectives  of  Sec.  5.2. 

5.5l  Request  Satisfaction 

The  procedure  selected  and  most  of  the  other  procedures  considered 
to  date  operate  by  making  single  changes  to  a  set  S  which  initially  con¬ 
tains  the  Y  set  of  the  request.  Documents  not  in  S  that  are  positively 
correlated  to  S  are  considered  for  addition  to  S  and  documents  in  S  that 
are  negative  to  S  are  considered  for  deletion  from  S.  Let  us  first 
settle  the  question  of  whether  it  is  possible  in  general  for  a  procedure 
of  this  type  to  locate  an  answer  cluster  if  one  exists. 

Theorem.  It  is  always  possible  to  transform  a  set  S  which 
iritially  contains  only  the  Y  set  of  the  request  into  a  (subset) 
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answer  cluster  if  one  exists  by  successively  adding  to  S 
documents  that  are  positively  correlated  to  S. 

Proof.  The  proof  of  this  theorem  will  be  constructive. 

(1)  Initialize  the  set  S  with  Y. 

(2)  If  S  coincides  with  the  answer  cluster  A,  the  procedure 
can  terminate. 

(3)  Otherwise,  consider  the  set  of  documents  (Afls)  yet  to 
be  added  to  G  to  form  A.  By  the  definition  of  a  subset  cluster  in 
Sec.  b.2,  (Af"ls)  must  be  positively  correlated  to  S  and  thus  there  is 

at  least  one  document  in  (aHs)  that  is  positively  corr,''ated  to  S.  Add 
this  document  to  S  and  go  back  to  Step  (2).  QED 

Note  that  this  theorem  is  true  only  for  subset  clusters.  We  can 
show  that  it  does  not  hold  for  local  maximum  clusters  by  the  example  of 
Pig.  5*9.  The  set  (y^x^x^)  forms  a  local  maximum  cluster, but  it  cannot 
be  reached  from  the  set  S^y^)  by  the  addition  of  documents  positively 
correlated  to  S. 


Links  now  shown  are  -5 


Fig.  5.9-  Local  maximum  cluster  not  accessible  tc  procedure. 


Even  when  the  theorem  still  does  not  hold  for  local  maxi¬ 
mum  clusters.  In  the  network  of  Fig.  5.10  the  set  a*ain  fon® 

a  local  maximum  cluster,  but  it  cannot  be  reached  from  the  set  Sft*(y,y-) 

V  ±  C 

by  the  addition  of  positively  correlated  documents. 


Links  shown  are  *ii 
Links  not  shown  are  -5 


Fig.  5.10.  Local  maximum  cluster  not  accessible  to  procedure 
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Actually  It  may  be  a  distinct  advantage  if  procedures  of  the  type 
being  considered  cannot  reach  certain  local  maximum  clusters.  It  was 
noted  in  Sec.  U,5  that  a  procedure  which  produces  subset  clusters  only 
would  be  preferred  over  one  that  results  in  local  maximum  clusters;  but 
that  such  a  procedure  had  not  been  found.  The  above  theorem  and  comments 
show  that  procedures  of  the  type  selected  can  generate  for  a  given 
request  all  of  the  subset  clusters  which  satisfy  a  given  request.  In 
addition  they  may  locate  some  (but  not  all)  of  the  additional  local 
maximum  clusters  which  satisfy  the  request. 

Let  us  now  observe  that  we  have  so  far  only  proved  that  a  suitable 
clustering  procedure  of  the  type  suggested  may  exist.  The  'constructive 
proof  of  the  theorem  does  not  indicate  how  to  choose  the  correct  docu¬ 
ment  to  add  to  S  in  Step  (3)  if  several  documents  are  positive  to  S. 

One  could,  of  course,  try  all  possibilities.  Let  us  represent  these 
possible  additions  by  a  tree  where  each  branch  out  of  a  node  represents 
the  addition  of  a  positively  correlated  document  to  S.  In  the  example  of 
Fig.  5*11  there  are  three  documents  positively  correlated  to  y^,  two 
positively  correlated  to  the  set  (y^x^),  etc. 


A  procedure  which  traversed  all  of  the  branches  of  such  a  tree 
would  be  assured  by  the  preceding  theorem  of  finding  an  answer  (subset) 
cluster  if  one  existed.  However,  one  can  quicKly  convince  himself  that 
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such  an  exhaustive  examination  of  all  possible  positively  correlated 
additions  is,  in  general,  completely  impractical  because  of  the  magni¬ 
tude  of  the  task.  Whst  ia  needed  is  some  way  of  determining  which  of 
the  positively  correlated  documents  should  be  added  to  S  on  each  itera¬ 
tion. 

There  will,  of  course,  be  cases  where  the  answer  cluster  is 
obtained  no  matter  which  of  the  positively  correlated  documents  16  added 
to  S  on  a  given  iteration.  A  simple  example  of  a  request  and  network 
for  which  this  is  the  case  is  given  in  Fig.  5.12.  On  the  first  itera¬ 
tion  one  can  add  either  x^  or  Xg  and  still  end  up  with  the  answer 
cluster  (y1y2x1x2). 

Links  shown  are 

Fig.  5  .12.  Network  where  it  does  not  matter  which  document 
is  added  to  S  first. 

However,  in  the  more  general  case  the  choice  of  which  document  to 
add  to  S  on  each  iteration  is  a  very  critical  aspect  of  the  clustering 
procedure.  The  answer  to  a  request  may  not  even  be  *’ound  if  the  wrong 
document  is  added  to  S  cn  one  or  more  of  the  iterations.  As  an  example, 
consider  the  network  and  request  of  Fig.  5.7.  If  the  procedure  were  to 
add  to  S  on  the  first  iteration,  then  (y^x^x^),  the  only  cluster 
which  satisfies  the  request,  would  not  be  found. 

Let  us  now  describe  the  criteria  used  by  the  procedure  of  Sec.  5.J 
to  decide  which  document  to  add  to  S  on  each  Iteration  and  note  how 
these  criteria  might  help  in  obtaining  an  answer  cluster  if  one  exists. 

In  Steps  9-11  of  Fig.  5.5  preference  is  given  to  documents  that  are 
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positively  linked  to  each  y^  (or  else  leave  the  y1  positive  to  S)  and 
negatively  linked  to  each  z^  (or  else  leave  the  z^  negative  to  S).  Hie 
network  of  Fig.  5.7  serves  as  an  example  of  how  this  preference  might 
aid  in  obtaining  the  answer  cluster.  Documents  and  x^  are  considered 
for  addition  to  S  before  and  x ^  and  the  answer  cluster  (y^x^)  is 
obtained. 

Steps  12  and  15  of  Fig.  5.5  are  for  the  purpose  cf  minimizing  the 
bias  on  each  iteration  and  will  be  discussed  when  we  talk  about  request 
modification  and  ambiguity. 

In  Step  lli  the  document  which  is  selected  for  addition  to  S  is  the 
one  that  has  the  highest  positive  correlation  to  S  from  among  those  docu¬ 
ments  that  have  met  all  of  the  earlier  criteria. 

The  theorem  at  the  beginning  of  this  section  shows  that  the  only 
operation  that  a  procedure  needs  to  perform  is  the  addition  of  positively 
correlated  documents  to  S  if  the  appropriate  document  to  be  added  on 
each  iteration  can  be  determined.  If,  in  fact,  the  procedure  mistakenly 
adds  on  a  given  iteration  a  document  which  is  not  part  of  the  answer, 
then  it  may  still  be  possible  to  arrive  at  the  answer  if  the  procedure 
is  allowed  to  also  delete  documents  that  have  become  negatively  corre¬ 
lated  to  S  (Steps  5-7  of  Fig.  5.1i).  In  the  network  of  Fig.  5.13  the 
answer  sij*(y^2xlx2  ^  i8  ot>t*ined  even  though  Sl*^yly2*3^’ 

Links  shown  are  *U 
Links  not  nhovn  are  -5 

Fig.  5  ,  1J.  Network  shoving  that  the  procedure  mist  be 
allowed  to  delete  as  well  as  udd. 

Despite  the  above  features  which  help  in  the  choice  or  the  aocu- 
ment  to  be  added  on  each  iteration,  there  are  still  casea  where  the 
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procedure  of  Sec.  5.3  does  not  find  an  answer  cluster  even  when  one 
exists.  Consider  the  request  and  network  of  Fig.  5-li*.  Documents  x^, 

*2 ,  and  x^  are  linked  to  the  documents  in  sets  Y  and  Z  by  exactly  the 
same  values  and  are  all  candidates  for  addition  to  S  on  the  first  itera¬ 


tion.  If  the  first  document  to  be  added  is  either  or  x^,  then  the 
procedure  finds  the  cluster  (x^y^)  which  is  the  only  valid  answer 
cluster  for  the  request.  If,  however,  x^  is  added  to  S  first,  then  the 
procedure  reaches  a  point  where  no  bias  can  be  chosen  which  will  simulta¬ 


neous^  keep  and  positive  to  S  and  negative  to  S  and  the  request 


Links  Bhown  are  *h  unless 
otherwise  indicated. 

Links  not  shown  are  -5. 


Only  valid  answer  cluster  «  (yiy2xlx2^ 


Fig.  5.ll*.  network  illustrating  the  difficulties  involved 
in  knowing  which  document  to  add  to  S  on  a 
given  iteration. 


The  alternatives  open  to  the  procedure  for  the  network  of  Fig.  5. lit 
are  shown  in  the  decision  tree  of  Fig.  5 . 15 .  It  should  be  pointed  out 
that  all  of  the  procedures  discussed  in  this  chapter  decide  which  docu¬ 
ment  to  add  to  S  on  each  iteration  on  the  basis  of  the  relatedness  of 
the  document  being  considered  to  the  documents  in  the  S,  Y,  and  Z  sets 
only.  The  inter-relatedneaa  of  the  documents  not  in  S,  Y,  and  Z  is  not 
a  factor  in  the  selection.  Indeed,  from  a  practical  standpoint,  it  can¬ 
not  be  used  as  a  factor  in  the  decision,  since  it  would  necessitate 
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considering  the  consequences  of  adding  subsets  of  documents  instead  of 
single  documents  and  for  r  documents  under  consideration  there  are  as 
many  as  2r  subsets  to  consider. 


30S 


S2! 


S3: 


(yly2xl) 

'yly2Xlx2 )  *yly2XlX2 '  ^yly2XjXii^  (y1y2X3x5^ 

X  / 

(yiy2XjV5)  (yly2X3XdX5) 

I  / 

Inconsistent  Inconsistent 


Fig.  5.15.  Tree  illustrating  the  possible  additions  to 
S  for  the  network  and  request  of  Fig.  5. Id. 


If  the  documents  to  be  added  to  S  are  chosen  on  the  basiB  of  their 
relatedness  to  the  S,  Y,  ar.d  Z  sets  only,  then  there  is  no  way  of  deter¬ 
mining  whether  to  add  x^,  Xj,  or  x^  to  SQ  in  Fig.  5. id.  If  one  cannot 
tell  beforehand  whether  to  add  x^,  x^,  or  x^,  then  perhaps  a  procedure 
should  be  devised  that  would  at  some  later  point  back  up  and  try  another 
’direction'  if  S  becomes  inconsistent  with  the  request.  In  other  words, 
if  x^  is  added  to  S  in  Fig.  5. Id,  perhaps  one  could  on  the  fourth  ite"i- 
tion  rmsiove  s  subset  containing  x^  from  S  and  add  x^  and  x^.  Such  a 
step  would  require  not  only  that  the  procedure  be  able  to  know  which 
subset  to  remove  but  also  that  it  remember  all  of  the  previous  S  sets 
so  that  it  would  not  fall  into  a  non-terminating  cycle,  this  approach 
is  alto  rejected  as  not  being  practicel. 

The  philosophy  adopted  for  this  research  project  ia  that  for  those 
esses  where  the  procedure  has  difficulty  in  locating  an  answer,  that  the 
user  tho  ’ld  be  coupled  into  the  procedure  to  guide  the  process  in  the 
right  direction.  This  is  the  reason  for  the  interaction  points  in  the 
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procedure.  The  user  can  step  in  before  the  addition  or  deletion  of  any 
document  and  over-ride  the  decision  of  the  procedure  by  changing  the 
request,  if  he  decides  the  cluster  is  moving  into  the  wrong  area.  In 
the  case  of  Fig.  5.1l  the  user  could  easily  obtain  the  cluster 
by  specifying  any  member  of  the  set  (xyc^XgXgXy)  to  be  uninteresting. 

5.52  Request  Modification 

If  the  request  as  initially  specified  by  the  user  is  inconsistent 
or  ambiguous,  then  some  additional  interplay  may  be  needed  between  the 
system  and  the  user  so  that  it  cau  be  appropriately  modified.  Let  us 
make  some  general  comments  about  the  suitability  of  the  clustering  pro¬ 
cedure  for  interaction  with  a  user  and  then  deal  specifically  with  the 
problem  of  what  particular  type  of  interaction  is  needed  to  resolve 
request  Inconsistency  and  ambiguity. 

If  a  clustering  procedure  is  to  be  used  in  close  coupling  with  the 
user,  then  the  process  should  be  divisible  into  small  units  of  effort. 
Each  unit  of  effort  should  produce  some  useful  piece  of  information  that 
can  be  presented  to  the  user  and  the  user  should  be  able  to  make  changes 
to  the  request  between  these  units  of  effort. 

The  natural  unit  of  effort  is,  of  course,  the  iteration.  The 
information  produced  by  the  iteration  is  the  document  tc  be  added  to  or 
deleted  from  S.  The  charge  in  the  request  can  be  the  response  of  the 
user  to  the  document  presented.  An  iterative  clustering  procedure, 
therefore,  lends  itself  very  well  to  close  supervision  by  the  user. 

The  re  are  four  Interaction  points  shown  for  the  procedure  of 
Sec.  5.).  The  initial  specification  of  the  request  is  made  at  Step  1. 

In  Step  b,  which  immediately  precedes  the  deletion  of  *  document  f-am  S 
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(Step  7),  the  user  is  given  a  chance  to  examine  the  document  to  be 
deleted  and  to  modify  his  request  if  he  wishes  to.  In  Step  13  the  user 
is  allowed  to  ask  questions  and  change  the  request  before  the  addition 
of  a  document  to  S.  In  Step  23  the  request  is  Judged  inconsistent  and 
the  user  is  again  allowed  to  obtain  information  from  the  system  and 
modify  the  request.  Wiese  four  steps  provide  an  interaction  point  before 
each  change  to  S  and  on  each  iteration  of  the  procedure.  A  description 
of  the  full  range  of  questions  that  can  be  asked  by  the  user  at  these 
interaction  points  will  be  given  when  the  retrieval  language  is  presented 
in  Chapter  VIII. 

Let  us  now  consider  the  problem  of  determining  whether  a  request  is 
Inconsistent  or  ambiguous.  One  test  for  inconsistency  has  already  been 
given.  The  last  theorem  or  Sec.  li.5  states  that  in  order  for  two  nega¬ 
tively  correlated  documents  to  be  in  the  >>«  cluster  they  must  be  posi¬ 
tively  linked  to  at  least  one  common  document  (if  K^-C^^).  u* 

present  three  more  theoresu  pel  taining  to  whether  two  documents  are 
assured  of  being  in  a  cluster  together  or  not. 

Theorem.  Two  documents  end  x^  can  be  positively  correlated 
to  exactly  the  same  documents  and  negatively  correlated  to  the 
same  documents  end  still  not  be  in  the  same  clusters. 

Proof.  Consider  the  example  of  Pig.  5.16.  The  documents  and  x^ 
are  both  positively  correlated  to  x^  and  x^  and  negatively  correlated  to 
x^.  F.ovever,  forma  a  cluster  which  contains  x^  snd  excludes 

Xj.  The  link  betveen  x^  end  x^  Is  dotted  to  show  that  they  ean  be  posi¬ 
tively  or  negatively  linked  end  the  theorem  would  still  be  true.  QED 
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Theorem,  t  document  x^  can  be  positively  correlated  to  every 
document  that  a  document  x,  is  negatively  correlated  to  (and  vice 
versa)  and  x^  and  can  still  be  in  a  cluster  together. 

Proof.  The  networks  in  Fig.  5.1?  offer  a  proof  of  this  theorem. 
The  documents  and  k,  are  in  the  same  cluster  (x^x^x^x^)  and  yet  the 
values  of  their  links  to  x^  and  x^  have  the  opposite  signs.  t^ED 


If  one  adds  the  restriction  that  Kft-C  ,  then  the  above  theorem 

max 

is  only  true  for  positively  correlated  document  pairs.  The  last  theorem 
of  S»r.  1.  states  that  when  two  negatively  correlated  docu¬ 

ments  can  occur  in  a  cluster  together  only  if  they  are  positively  linked 
to  one  or  more  of  the  s»m*  documents. 

Theorem.  Two  documents  *,  and  «r»  assured  of  always 
being  in  the  same  cluster*  tjgepher  if  C(*‘*  )  '*  greater  than 
the  absolute  magnitude  -of  the  difference  l n  the  correlations 
of  u,  and  x,  to  every  possu-.e  subset  of  other  documents. 
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Proof.  To  prove  this  theorem  let  us  assume  that  and  '•re  not 
In  the  same  cluster  and  then  show  a  contradiction.  Let  us  say  that 
forme  a  cluster  with  the  set  of  documents  A  which  does  not  include  x^  as 
indicated  in  Pig.  5.1d. 


Fig.  5.18.  Betwork  for  proof  of  theorem. 

Since  is  a  cluster: 

C(xJa)>0 

C[(xJ)(A'JxJ)]£0 

Rear  ranging  and  c.umcining  these  inequslities-- 

C(xJa)  ♦  C(xjx*}<0 

C(^xJ)^-C(x^A) 

C(xJxJ)^C(xJa)-C(xJa) 

C(xi  ^)^Ic(x^A)  -C(xJa)| 

This,  last  inequality  is  in  conflict  with  the  part  of  the  theorem 
which  state*  that  for  any  A: 

c(x^)>|c(*Ja)  -c(xJa)|  jj0 

These  three  theorems  give  suave  i’ldlcation  of  the  difficulties 
involved  in  determining  i '  two  docunents  ere  in  the  same  cluster  on  the 
tails  of  the  links  from  those  doc  waeuts  to  the  other  dOC'.usent*  of  the 
netvor*.  The  third  theorem  her>*  'id  the  la-t  theorem  of  hec.  i..5  would 


‘••“ip  in  some  crises  to  determine  whether  documents  can  co-occur  in 
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clusters,  but  they  have  far  from  general  applicability. 

It  was,  therefore,  concluded  that  there  was  no  easy  test  which 
could  be  initially  performed  to  determine  If  the  request  was  inconsis¬ 
tent  or  ambiguous.  The  tests  which  were  devised  consisted  of  attempts 
to  find  one  or  more  clusters  which  satisfied  the  request  and  required  at 
least  as  much  effort  as  the  finding  of  an  answer  for  a  valid  request. 

It  was  decided  that  tne  procedure  should  not  concern  itself  with  the 
problems  of  request  ambiguity  and  consistency  at  first  but  should  assume 
that  the  request  is  valid  and  start  trying  to  find  the  answer  cluster. 

If  during  this  process  it  was  decided  that  the  request  was  inconsistent, 
then  the  user  would  be  notified  of  this  fact.  And  if  the  user  was  still 
worried  about  ambigui ;y  after  a  cluster  had  been  found,  then  he  could 
perform  some  further  searching  to  satisfy  himself  that  he  had  retrieved 
what  he  was  after. 

It  was  further  decided  that  the  user  should  he  given  the  option  of 
being  able  to  interact  with  the  procedure  on  any  or  all  of  the  itera¬ 
tions  in  order  to  monitor  what  was  being  retrieved  and  in  order  to 
modify  the  request  if  the  situation  demanded  it.  Thus  a  user  who  sus¬ 
pected  his  request  to  be  smbiguous  or  inconsistent  could  carefully  watch 
what  documents  were  being  added  to  S  to  make  sure  that  he  was  obtaining 
what  he  wanted,  while  the  user  who  had  confidence  in  the  validity  of  his 
request  could  let  the  procedure  run  to  completion  unattended. 

The  rule  which  was  followed  in  the  design  of  the  procedure  of 
Sec.  5.3  was,  therefore,  to  allow  the  user  to  interact  at  any  point  he 
wished  to  (and  especially  in  cases  where  an  invalid  request  was 
suspected),  but  to  never  require  that  he  respond  before  the  clustering 
could  continue.  Thus  in  Steps  23  and  2it  of  Fig.  5.6  the  request  appears 


to  be  inconsistent.  The  user  is  given  the  chance  of  changing  nls 
request  if  he  wishes.  If  no  change  is  made,  then  the  procedure  picks  a 
document  to  be  deleted  from  Z  so  that  clustering  can  continue. 


103 


Also  in  the  case  of  ambiguity  the  procedure  is  designed  to  find  the 
most  reasonable  answer  cluster  it  can  for  presentation  and  not  to  depend 
on  the  user  to  clear  up  the  ambiguity.  This  is  the  purpose  of  Steps  12 
and  1$  in  Fig.  5.5.  If  two  clusters  with  different  biases  are  both 
valid  answers  to  the  request,  then  the  one  with  the  smaller  bias  is 
considered  a  better  selection.  Therefore,  an  attempt  is  made  to  make 
the  bias  as  small  as  possible  on  each  iteration. 

$.53  Convergence 

A  major  objective  in  the  design  of  the  clustering  procedure  1b  to 
insure  that  it  will  always  terminate  in  a  finite  number  of  steps  for 
every  possible  document  network  end  every  possible  request.  A  procedure 
which  occasionally  drops  into  an  infinite  loop  would,  of  course,  be 
completely  unacceptable.  The  possibility  of  an  -nfinito  loop  comes 
about  because  of  the  fact  that  the  procedure  can  delete  as  well  as  add 
documents  to  the  set  S.  If  on  fame  iterations  the  set  S  has  the  same 
composition  as  it  had  on  a  previous  iteration,  and  if  the  procedure 
does  not  remember  all  of  the  previous  S  sets,  then  a  non-terminating 
cyclic  behavior  is  post.. le. 

In  Phase  I  of  the  procedure  convergence  is  assured  by  the  following 
theorem. 

Theorem.  A  procedure  is  converger t  if  the  only  types  of 
changes  made  to  the  set  S  being  formed  are  the  addition  of 
documents  positively  correlated  to  S  and  the  deletion  of 
documents  negatively  correlated  to  S. 


\ 


m, 

10li 

Proof.  The  internal  correlation  of  S  is  increased  by  the  addition 
of  a  document  positive  to  S.  It  is  also  increased  by  the  deletion  of  a 
document  negative  to  S.  Thus  C(S)  increases  monotonlcally  as  these  two 
types  of  changes  are  made  to  S.  This  means  that  C(s)  is  larger  on  a 
given  iteration  than  for  any  earlier  iteration.  Therefore  the  composi¬ 
tion  of  S  must  be  different  on  each  iteration.  Since  there  are  at  most 
2n  possible  S  sets  (for  a  network  of  n  documents),  there  are  at  most  2n 
iterations  of  t.he  procedure  before  it  terminates.  QBD 

If  the  bias  of  the  network  is  changed  as  it  is  in  Phase  II,  then 
the  above  theorem  no  longer  insures  convergence.  For  example,  the 
following  steps  might  potisioly  be  taken  by  a  hypothetical  procedure  in 


trying  to  obtain  a  cluster  in  the  network  of  Fig.  5-19. 

5  / 

Links  not  shown  are  -6 

Fig.  5.19*  Network  which  may  cause  a  procedure  to  cycle. 
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(2) 
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X 

J 
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(U) 

Bias  «-2  to  keep  z 
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1 

* 

(5) 

S3*^ylxlx2x3^ 

C(x3S2)-l 

i 

(6) 

Bias  «-3  to  keep  z 

negative 

(7) 

Si*‘(ylXlX2) 

(8) 

Bias  m-2  to  Just  keep  z^  negative 

At  this  point  the  procedure  returns  to  Step  (5)  in  a  never  ending 

loop. 
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In  order  to  avoid  such  cycles  Phase  II  of  the  procedure  selected 
(Sec.  5.3)  synchronizer  each  change  in  bias  with  the  addition  of  a 
document  to  S.  If  the  document  being  added  increases  the  internal 
correlation  of  S  by  k  bits,  then  a  decrease  in  bias  is  allowed  which 
decreases  the  internal  correlation  by  up  to  k  bits.  Thus  the  total 
internal  correlation  of  S  is  still  increased  on  each  iteration  and 
convergence  is  again  assured. 

In  the  above  example  Phase  II  would  combine  (synchronize)  Steps  (3) 
and  (b)  and  allow  the  blab  to  still  be  -2  bits.  Steps  (5)  and  (6)  would 
also  be  combined  but  the  bias  would  only  be  allowed  to  go  to  -2.2  bits 
(b,»C(x,S)/5).  Step  (7)  would  not  be  taken  because  x^  would  not  be 
negative.  [C(x^S)»0.6] . 

Thus  far  we  have  talked  about  the  effect  of  decreasing  the  bias 
on  convergence.  An  increase  in  bias  does  not  reduce  the  total  internal 
correlation  and  would  not  necessarily  have  to  be  synchronized  with 
additions  to  the  set.  For  purposes  of  symmetry,  however,  bias  increases 
are  placed  under  the  same  restrictions  that  bias  decreases  are. 

Finally,  let  us  consider  convergence  in  Phase  III.  Bias  changes 
that  are  not  synchronized  with  the  addition  of  a  document  are  now 
allowed,  but  the  bias  can  change  in  only  one  direction.  We  have  already 
shown  that  the  clustering  procedure  is  limited  to  a  finite  number  of 
iterations  for  a  given  bias  (by  the  above  theorem).  Phase  III  permits 
only  a  finite  number  of  bias  changes  so  the  total  number  of  iterations 
is  finite  and  we  are  assured  of  convergence  once  more. 

5.5b  Minimum  Number  of  Iterations 

Itoose  steps  which  are  taken  to  improve  the  proper  selection  of  the 
document  to  be  added  on  each  iteration  should  also  help  to  decrease  the 
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number  of  deletions  necessary  on  later  iterations.  We  have  already 
discussed  the  problem  of  choosing  the  correct  document  on  a  given 
iteration. 
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PAST  THREE:  EXPERIMENTAL  SYSTEM 

In  the  lest  three  chapters  the  basic  components 
of  the  theoretical  model  were  presented.  The  next 
three  chapters  describe  the  experimental  system  which 
was  developed  so  that  the  Ideas  and  concepts  of  the 
model  could  be  tested  in  a  realistic  environment. 

The  four  aspects  ot  the  experimental  system 
that  will  be  covered  are: 

Chapter  VI:  Computational  Facilities  and 
Data  Bate 

Chapter  VII:  File  Structure 

Chapter  VIII  .•  Interaction  Language 


CHAPTER  VI 
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COMPUTATIONAL  PACILIT1ES  AND  DATA  BASE 

There  are  tvo  projects  at  M.I.T.  on  which  this  research  endeavor  is 
highly  dependent.  Project  MAC  supplied  the  computational  facilities  for 
the  experimental  phase  of  the  project.  The  Technical  Information  Project 
supplied  the  document  collect' on  and  data  base  on  which  the  experiments 
were  performed.  In  addition  these  two  projects  provided  considerable 
other  technical  and  general  assistance.  Since  the  computational 
facilities  and  data  base  are  essential  ccuponents  of  ^he  experimental 
system,  they  will  now  be  described. 

6.1  Congiutatlonal  Facilities 

The  experimental  portion  of  this  project  was  designed  for  the 
Project  MAC  time-sharing  systen^1.  In  this  section  we  shall  describe 
the  MAC  system  and  note  some  of  its  features  that  are  of  particular 
significance  to  this  project.  A  more  complete  description  of  the 
objectives  and  characteristics  of  the  MAC  system  can  be  found  in  the 
references^2’ 2 1 

Pig.  6.1  is  an  abbreviated  diagram  of  the  equipment  included  in 
the  MAC  system.  Some  of  the  more  significant  parameters  of  this  equip¬ 
ment  are  given  in  Pig.  6.2.  All  of  the  equipment  shown  in  Pig.  6.1  is 
physically  located  at  M.I.T.'s  Technology  Square  with  the  exception  of 
the  time-sharing  consoles.  Over  100  of  these  consoles  are  located  at 
various  places  on  the  M.I.T.  campus  and  can  be  connected  to  the  7750 
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through  the  M.I.T.  telephone  exchange,  lbere  are  also  MAC  console?  at 
more  remote  locstlons.  Indeed  any  TWX  or  TELEX  telegraph  station  has 
the  capability  of  being  connected  into  the  MAC  system.  Each  console 
has  a  dual  purpose.  It  communicates  to  the  7750  what  characters  have 
been  typed  on  its  keyboard  and  it  also  types  out  messages  originating 
in  the  709U  that  are  routed  to  it  through  the  7750. 

In  a  time-shared  computer  a  number  of  consoles  can  be  simultaneously 
connected  into  the  system  and  can  independently  obtain  the  services  of 
the  central  processor.  A  limit  is  normally  placed  on  the  number  of 
consoles  that  can  be  actively  connected  at  any  one  time.  Th<*  purpose  of 
this  limit  is  to  help  insure  that  those  who  are  connected  vill  be 
promptly  serviced.  The  current  limit  for  the  MAC  system  is  30,  but  it 
varies  periodically  as  changes  and  improvements  are  made  in  the  system. 

One  of  the  core  storage  banka  (bank  A)  contains  the  time-sharing 
supervisory  program.  This  program  decides  which  of  the  users  who 
currently  want  service  has  the  highest  priority.  The  program  of  the 
highest  priority  user  is  loaded  into  core  (bank  B)  from  the  disc  or 
drum  and  allowed  to  run  for  up  to  two  or  three  seconds.  Then  the 
program  is  removed  (swapped)  and  the  new  highest  priority  program  is 
loaded  and  run. 

The  IBM  1302  disc  is  used  for  permanent  or  temporary  storage  of 
programs  and  data.  The  data  file  to  be  described  in  the  next  section 
is  stored  on  this  disc  as  well  as  programs  -nich  arrange  and  structure 
it  and  allow  the  user  to  communicate  with  it. 
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fig.  6.1.  Project  MAC  Equipment  Configuration. 
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fig.  6.2.  Significant  Parameters  of  MAC  System. 
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6.2  Data  Base 

The  basic  data  needed  to  isplement  the  theoretical  model  of  Part 
Two  is  a  document  collection  and  a  file  of  partitionings  of  that 
collection.  The  document  collection  selected  is  described  in  the  next 
section  and  the  final  section  of  the  chapter  contains  a  discussion  of 
the  type  of  partitioning  data  that  will  be  used. 

6.21  Document  Collection 

The  Technical  Information  Project  at  M.I.T.  is  currently  sccuau- 
latlng  a  file  of  information  on  articles  found  in  the  physics  periodical 
literature?^  This  file  covers  about  26,000  articles  from  25  different 
Journals.  Pig.  6.3  lists  the  names  of  the  Journals  and  the  extent  of  the 
coverage  in  terms  of  volumes.  The  time  period  covered  for  each  Journal 
is  1  Jan.  1963  to  the  present.  Rote  that  all  of  the  articles  in  the 
volumes  listed  are  included. 

One  can  gain  some  appreciation  of  the  extent  of  the  coverage  of  the 
file  by  noting  that  the  25  Journals  account  for  over  50<of  the  articles 
that  are  abstracted  for  Physics  Abstracts. 

The  file  is  currently  growing  at  the  rate  of  1500  articles  a  month. 
Periodically  new  Journals  are  added  to  the  file.  Journals  to  be  included 
are  selected  on  the  basis  of  a  statistical  analysis  of  their  citations. 
This  selection  criteria  Is  described  more  fully  elsewhere  . 

The  information  extracted  for  each  article  is  tbe  Journal  identifi¬ 
cation,  volume  and  page  number,  title,  author(s),  author  locstlon(s), 
and  coded  bibliographic  citations.  Pig.  6.1  is  sn  example  of  the  infor- 
amtion  eve  ..able  in  a  given  article.  Pig.  6.5  summarizes  some  of  the 
parameters  of  the  file. 
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Journal  Volume  Humber  of 


Journal 
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Articles 

1.  Annal*  of  Physic* 

38a 

21-36 

275 

2.  Applied  Physic*  Letters 

6U6 

2-8 

592 

3.  Canadian  Journal  of  Physics 
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iil-lili 

531 

U.  Helvetica  Physlca  Acta 
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36-38 
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5.  Indian  Journal  of  Physics 
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37-39 

165 

6.  Japanese  Journal  of  Applied  Physics 

612 

2 -a 

328 

7.  JETP  Letters 

821 

1-2 

65 

8.  Journal  of  Applied  Physics 

11 

3U-37 

16U3 

9.  Journal  of  Chemical  Physics 

12 

38-aa 

3398 

10.  Journal  of  Mathematical  Physics 

227 

6 
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11.  Journal  of  the  Physical  Society  of  Japan 

AO 

13-20 

759 

12 .  Nuovo  Cimento 

17 

27 -ao 

1385 

13 •  Nuclear  Physics 

682 

46 -7  5 

1529 

ill.  Physics 

21 

29-31 

359 

15.  Physical  Review 

1 

129-ia2 

3713 

16.  Physical  Review  (Series  B) 
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’791 

17.  Physical  Review  Letters 
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1585 
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J1-20 

2880 

19-  Physics  of  fluids 

799 
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607 

20.  Proceeding*  of  the  Physical  Society  (London) 

3 

81-87 

738 

21.  Progres*  of  Theoretical  Physics  (Kyoto) 

29 

29-3U 
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22.  Soviet  Journal  of  Nuclear  Physics 

825 

1 
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23.  Soviet  Physics  -  JETP 
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2h.  Soviet  Physics  -  Solid  State 

uo 

5-7 

81a 

25.  So. let  Physics  -  Technical  Physics 

790 

6-10 

898 

176  26,u7l 


Fig.  r.J.  Journal*  covered  by  the  physic*  periodical  rix* 

of  the  Technic*!  Information  Project  (March  20,  1966). 
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Fig.  6 .U .  Example  of  the  information  avallaole  on  a  given 
article.  The  last  four  lines  are  the  coded 
citations  (j«Joumal,  V*volume,  P*page). 


Humber  o?  articles  available  on  the  disc  26,U71 
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(1)  Physical  Review,  Vol.  77-128  (1950-196?) 
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Average  number  of  articles  per  track  6.7 

Average  number  of  authors  per  article  2.02 
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Average  number  of  words  per  title  3. 


Fig.  Pcrtiaetera  of  T.I.P.  data  'lie  (March  20,  19'''*). 
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Initially  the  infomation  is  key-punched  on  IBM  cards.  After  some 
preliminary  editing  and  correction  it  Is  then  loaded  on  the  IBM  1302  disc 
of  the  Project  MAC  computer.  On  the  disc  it  undergoes  more  editing  and 
is  transformed  into  the  format  selected  for  permanent  storage  (see 
Sec.  7.1). 

Tue  T.I.P.  file  has  certain  features  which  make  it  attractive  for 
use  by  this  research  project.  It  is  jf  sufficient  size  ar»d  interest  to 
attract  serious  users.  The  articles  covered  contain  a  substantial 
number  of  citations  which  vill  be  shown  to  be  of  particular  use  shortly. 
The  generation  of  the  data  involves  only  clerical  and  mechanical  opera¬ 
tors  (i.e.  no  human  indexing  or  eTaiuation  is  required). 

6.22  Partitions 

Some  of  the  advantages  to  beving  a  retrieval  system  based  on  user 
feedback  were  discussed  in  Chapter  II.  A  basic  objective  of  this 
project  was  stated  to  be  the  investigation  of  the  feasibility  of  such  a 
sys'  .  In  Chapter  III  a  particular  form  that  user  feedback  could  take 
was  uescribed.  Basically  it  consisted  of  each  Interaction  of  a  user 
with  the  document  collection  resulting  in  a  partitioning  of  the  docu¬ 
ments  into  a  set  of  interesting  documents  and  a  set  of  uninteresting 
documents. 

This  type  of  interaction  vas  described  so  that  one  could  better 
understand  the  motivation  behind  the  choice  of  the  sample  space, 
probabilities,  and  other  aspects  of  the  theoretical  modal.  Actually  the 
theoretical  model  as  developed  in  Chapters  III,  IV,  and  V  in  no  vBy 
requires  that  the  partitionings  on  which  the  probe'  ility  estimates  re 
based  be  generated  by  user  interactions.  Any  type  o i  partitioning  dots 


could  be  used,  even  data  that  has  been  arbitrarily  contrived.  Indeed, 
in  the  experimental  systen  another  type  of  partitioning  vas  used  because 
usage  data  is  not  readily  available  at  the  present  time. 
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Let  us  consider  whether  a  change  in  the  type  of  partitioning  data 
employed  by  the  experimental  system  will  impair  its  effectiveness  in 
testing  whether  a  system  based  on  usage  data  is  feasible,  first  it  can 
be  observed  that  much  of  this  investigation  has  very  little,  if  any, 
dependence  on  the  particular  type  of  data  being  utilised.  For  example, 
the  objective  of  a  procedure  of  Chapter  V  is  to  find  a  cluster  of 
documents.  Its  ability  to  do  this  could  be  examined  and  tested  as  well 
on  the  set  of  arbitrarily  selected  parti tlonings  of  a  hypothetical 
document  collection  as  on  a  set  of  partitionings  generated  by  the  inter¬ 
action  of  a  real  user  population  with  a  real  library. 

There  are  some  reasons,  houeve-,  why  it  is  advisable  to  use  a  set 
of  partitionings  for  the  experimental  system  that  is  not  artificial  and 
which  resembles  usage  data  as  closely  as  possible.  For  example,  the 
utility  of  the  interaction  points  in  the  procedure  are  best  tested  by 
real  users.  This,  of  course,  requires  a  data  base  which  produces 
results  that  a  user  would  be  interested  in.  Also  the  overall  effective¬ 
ness  of  the  system  to  produce  useful  results  can  be  properly  evaluated 
only  in  a  realistic  environment. 

With  this  objective  in  mind  let  us  cow  consider  what  types  of 
partitionings  are  available  for  the  document  collection  described  in  the 
last  section.  There  were  five  types  of  partitionings  that  were 
evaluated  for  this  project.  They  consist  of  dividing,  the  set  of  docu¬ 
ments  into  two  subsets  based  on  whether  or  not  the  documents — 

(l)  were  written  by  a  given  author. 


(2)  contain  a  certain  word  in  their  titles. 

(3)  cite  a  given  article. 

(Ii)  were  cited  by  a  given  article. 

(5)  occur  in  a  given  subject  category. 

Thus  by  criterion  (l)  there  are  as  many  partitions  as  there  are  authors 
in  the  file,  with  each  author  dividing  the  document  file  into  those 
papers  he  wrote  and  those  he  didn't  write. 

A  detailed  analysis  of  each  of  the  above  types  of  partitionings  was 
cocducted  on  one  volume  (vol.  128)  of  the  Physical  Review.  Certain 
tests  were  also  conducted  on  such  larger  parts  of  the  document  collection. 
Let  us  summarize  the  results  of  these  tests  and  evaluate  each  of  the  five 
partitioning  criteria. 

(l)  Author  Partitions. 

Difficulty  was  encountered  in  devising  an  algorithm  that  could 
determine  if  two  author  names  referred  to  the  same  individual.  A  sur¬ 
prisingly  large  number  of  the  authors  were  not  consistent  in  the  way 
they  gave  their  names.  Given  names  were  sometimes  supplied  in  full, 
sometimes  represented  by  an  initial,  and  sometimes  left  off  altogether. 

The  method  which  yielded  the  best  results  required  an  exact  match  of  the 
surname  and  required  that  given  names  either  match  exactly  or  match  on 
the  first  letter  if  one  of  the  names  was  a  single  letter  (i.e.  an  initial). 
We  at  first  allowed  b  missing  given  name  to  be  a  match  for  anything,  but 
this  produced  too  many  false  matches.  We,  therefore,  required  that  in 
order  for  a  match  to  occur  the  number  of  given  names  had  to  coincide. 

Another  difficulty  was  that  roughly  half  of  the  authors  were  the 
authors  of  only  one  paper.  This  produced  a  large  number  of  partitionings 
with  only  one  document  in  the  subset  of  "interest",  with  the  consequence 
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that  there  were  many  of  the  papers  that  did  not  co-occur  with  any  other 
paper  by  this  method. 

A  third  drawback  to  this  type  of  partitioning  arises  in  those  cases 
where  an  author  changes  his  area  of  interest  and  publishes  articles  on 
unrelated  subjects. 

(2)  Word  Partitions. 

If  every  title  word  is  allowed  to  create  a  partition  of  the  file, 
then  practically  every  document  '.'ill  co-occur  with  every  other  document 
because  of  the  comion  function  words  like  "of",  "the",  etc.  The  alterna¬ 
tive  is  to  try  to  identify  and  exclude  from  use  function  words.  However 
there  is  no  clear  distinction  between  function  words  and  keywords.  It  is 
fairly  clear  that  certain  words  should  be  eliminated  if  co-occurrences 
are  to  be  meaningful.  However  there  is  a  large  grey  area  of  words  such 
as  "effect",  "wave",  "theory",  of  "electronic"  that  in  and  of  themselves 
create  little  meaningful  linkage,  but  in  combination  with  other  vordB 
are  very  significant.  The  approach  adopted  for  the  tests  was  to  elimi¬ 
nate  all  words  that  occurred  in  over  5-10 t  of  the  titles.  This 
unfortunately  eliminated  the  word  "nuclear"  while  allowing  words  like 
"between"  and  "theory"  to  create  partitions. 

A  second  problem  in  using  word  partitions  is  that  there  are  a 
number  of  words  which  differ  from  each  other  by  only  a  suffix  (i.e. 
superconductor,  superconductors,  superconducting,  superconductive, 
superconductivity).  A  table  was  compiled  of  liO  of  the  more  commonly 
occurring  suffixes  of  the  title  words  in  the  document  file.  Ail  of  the 
words  which  differed  from  each  other  by  one  of  these  suffixes  were  con¬ 
sidered  equivalent  in  creating  partitionings. 
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An  even  aore  basic  problem  involves  the  use  of  synonomous  words  for 
the  sane  concept.  Sone  type  of  thesaurus  would  be  necessary  to  link  up 
articles  with  synonymous  title  words.  It  was  decided  that  there  are  too 
many  problems  involved  in  the  generation  (or  selection)  and  use  of  a 
thesaurus  to  warrant  any  effort  in  this  direction  in  this  research 
endeavor. 

(3)  Cite-same  Partitions. 

When  two  papers  cite  one  or  more  of  the  same  papers  they  are  said  to 

be  bibliographically  coupled.  A  number  of  studies  have  been  conducted 
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to  analyze  the  characteristics  of  bibliographic  coupling  .  These 
studies  indicate  that  bibliographic  coupling  constitutes  a  very  meaning¬ 
ful  and  important  type  of  relationship  between  papers,  especially  in 
those  document  collections  which  have  a  sizable  amount  of  citation  lnfor 
mation.  In  the  T.I.P.  file  of  Sec.  6.21  there  are  an  average  of  12 
citations  per  article  and  strict  editorial  policies  make  it  easy  to 
identify  the  articles  that  are  cited. 

(U )  Cited-by  same  Partitions. 

We  note  from  Fig.  6.3  that  the  documents  covered  by  the  T.I.P.  file 
have  all  been  written  in  the  last  three  years.  Due  to  the  time  required 
to  review  and  publish  articles  there  is  usually  a  period  of  at  least  six 
months  between  the  time  an  article  is  published  and  the  time  citations 
to  it  begin  to  appear  in  the  literature.  And  even  after  a  span  of  two 
to  three  years  over  half  of  the  articles  in  the  Physical  Review  have 
still  not  been  cited  by  subsequent  articles  in  the  Physical  Review2?. 

Thus  this  type  of  partitioning  will  have  a  very  small  yield  for  the 
current  T.I.P.  file  in  terms  of  the  number  of  documents  that  will  occur 


in  one  or  more  subsets  of  interest  and  in  terms  of  the  total  number  of 
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co-occurrences  of  articles  that  will  be  generated . 

(5)  Subject  Category  Partitions, 

A  subject  index  is  published  of  the  articles  in  the  Physical  Review. 
Each  article  is  assigned  to  from  one  to  four  categories.  These  category 
groupings  fora  another  type  of  file  partitioning.  However,  not  all  of 
tl.e  25  Journals  have  subject  indexes  and  there  is  no  general  agreement 
on  category  Headings  among  the  indexes  that  do  exist.  Also  the  categories 
even  within  a  single  Journal  are  constantly  changing. 

In  the  beginning  we  decided  to  use  all  five  of  the  above  types  of 
pi  titionings  for  the  experimental  system  with  the  hope  that  each  would 
add  meaningful  links  to  the  resulting  document  network.  However,  the 
results  of  the  above  tests  led  us  to  conclude  that  the  use  of  criterion 
(3)  only  would  result  in  an  adequate  set  of  partitionings,  and  would 
avoid  some  of  the  problems  encountered  in  using  the  other  criteria.  The 
final  experimental  system  is,  therefore,  based  on  partitionings  of  type 
(3)  only. 
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CHAPTER  VII 
PILE  STRUCTURE 

Thus  far  we  have  described  the  computational  facility  on  which  the 
experimental  system  operates  and  the  data  it  uses.  Let  us  now  turn  our 
attention  to  the  problem  of  how  the  data  should  be  arranged  and  structured 
for  storage  on  the  disc  or  in  core.  The  first  section  of  this  chapter 
describes  the  general  approach  adopted  in  this  project  for  the  storage  of 
data.  Then  four  basic  types  of  files  are  suggested  and  various  comgina- 
tions  of  the  basic  types  are  proposed  for  the  overall  data  storage 
system  of  the  project.  Certain  arguments  favoring  the  overall  storage 
system  that  wr.s  selected  are  set  forth.  In  the  last  section  a  brief 
discussion  is  presented  of  the  type  of  data  structure  that  would  be 
appropriate  for  the  data  that  has  been  loaded  into  the  high  speed  core 
storage  for  processing. 

7.1  Description  and  Arrangement  of  Data 

A  few  rather  general  comments  on  the  problem  of  data  storage  are  in 
order  before  we  launch  into  a  description  of  the  particular  types  of 
files  considered  for  this  project. 

It  will  be  useful  in  our  discussion  to  hink  of  the  data  to  be  stored 
as  forming  a  tree-like  structure.  For  example,  the  information  file 
generated  by  the  Technical  Information  Project  (Sec.  6.21)  can  be  sub¬ 
divided  into  journals.  Each  of  the  Journals  can  be  broken  down  into  a 
number  of  volumes.  Each  volume  in  turn  consists  of  some  articles. 
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Within  an  article  there  are  several  information  types — title,  author(s), 
etc.  Some  of  these  Information  types  may  be  further  subdivided.  For 
example,  one  can  split  the  author  information  into  the  separate  authors 
of  the  article.  Fig.  7.1  portrays  this  tree  structure. 

Data  file 
Journal  nodes 
Volume  nodes 
Article  nodes 
Info,  types 
Separate  authors 

Fig.  7.1.  Example  of  tree-like  structure  of  data. 

Each  terminal  node  at  the  bottom  of  this  tree  represents  a  piece  of 
data  which  must  be  stored,  such  as  an  author's  name  or  a  citation.  Each 
parent  node  represents  the  grouping  together  of  one  or  more  pieces  of 
logically  related  data.  For  example,  a  volume  node  groups  together  all 
the  articles  which  are  contained  in  that  volume. 

Let  us  first  consider  a  couple  of  problems  involved  in  storing  the 
data  represented  by  the  terminal  nodes.  Much  of  this  data  is  variable 
in  length.  For  example,  titles  might  vary  from  20-200  characters.  Two 
ways  of  handling  variable  size  dats  suggest  themselves.  One  might  use  a 
special  code  or  flag  to  indicate  the  end  of  the  piece  of  data  or  one 
might  explicitly  store  the  length  somewhere  in  the  file.  The  latter 
approach  was  selected  since  one  would  Rlvays  have  to  perform  a  search  to 
determine  the  end  of  the  data  if  a  flag  were  used. 

In  addition  to  knowing  how  long  a  piece  of  data  is  we  must  know  its 
type  or  identification.  For  example,  it  is  not  possible,  in  general,  to 
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determine  whether  a  string  of  characters  is  a  title  or  an  author  without 
being  explicitly  told  this  fact.  If  there  were  one  and  only  one  title, 
author,  citation,  etc.  for  each  article,  then  the  information  type  could 
be  specified  by  the  relative  position  or  order  of  the  pieces  of  data. 
However,  for  a  given  article  there  may  be  none  or  several  citations  and 
one  cannot  specify  the  information  type  implicitly  by  the  order. 

Thus,  in  addition  to  storing  the  actual  data  for  each  terminal  node, 
one  must  give  two  additional  facts --length  and  type.  The  storage  of 
these  two  additional  facts  is  useful  for  the  parent  nodes  in  the  above 
tree  as  well  as  for  the  terminal  nodes.  The  type  of  information  for  a 
given  node  serves  to  identify  that  node  from  all  of  its  sister  nodes 
which  are  under  the  same  parent  node.  The  length  information  delimits 
the  scope  of  the  node.  For  example,  a  volume  node  would  have  for  its 
identification  the  volume  number,  and  for  its  length  either  the  number  of 
articles  in  the  volume  or  the  amount  of  storage  occupied  by  those 
articles.  Thus  one  can  summarize  the  storage  requirements  of  a  data  file 
by  the  following  two  statements.  An  identification  and  length  must  be 
stored  for  every  node  in  the  related  tree  structure.  In  addition  one 
must  store  a  piece  of  literal  data  for  each  terminal  node. 

The  last  question  to  be  discussed  here  relates  to  the  actual 
physical  order  in  which  data  is  to  be  stored.  Let  us  use  the  example  of 
Fig.  7.2  to  describe  the  arrangement  selected.  One  can  flatten  the  tree 
of  Fig.  7.2  out  into  the  linear  array  of  nodes  shown  in  Fig.  7.3  such 
that  no  two  connecting  lines  cross,  and  such  that  each  parent  node  is  to 
the  left  of  its  subnodes. 
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Article  node 


Fig.  7.2.  Example  used  to  show  physical  order  given  the  data. 


Fig.  7.3.  Linear  arrangement  of  data  in  Fig.  7.2. 

This  is  the  physical  order  in  which  the  data  is  stored  for  this 
project.  For  the  example  of  Fig.  7.3  the  article  identification  and 
length  are  first  (node  D).  This  is  followed  by  the  code  for  title 
information,  the  title  length,  and  the  actual  title  (node  T).  Next  is 
the  code  for  author  information  and  the  length  of  the  author  data 
(node  A).  Then  the  information  on  a  particular  author  is  given  (node  A^). 
This  includes  the  author's  identification  (his  position  among  the 
authors  of  the  article),  the  length  of  his  name,  and  his  actual  name. 

The  description  for  the  remaining  nodes  is  similar. 

It  may  be  of  interest  to  note  that  the  above  approach  is  anslagous 
to  polish  prefix  notation.  Consider  the  algebraic  equation  [A  •  (B+C)]. 
Its  polish  prefix  form,  • [A, >(B,C)],  is  obtained  by  flattening  the  tree 
of  Fig.  7.14  such  that  no  lines  cross.  If  one  equates  terminal  nodes  to 
operands  and  parent  nodes  to  operators,  then  our  storage  arrangement  is 
the  polish  prefix  form  of  the  data. 


121* 

Fig.  7.1*.  Polish  prefix  notation. 

7.2  types  of  Files 

In  this  section  four  basic  types  of  data  files  are  described.  An 
overall  data  storage  system  might  consist  of  only  one  of  the  file  types 
or  it  might  include  a  combination  of  several  types. 

7.21  Raw  Data  File 

The  file  of  data  generated  by  the  Technics’  Inforration  Project 
(Sec.  6.21)  will  be  termed  the  raw  data  file.  It  currently  has  the 
’polish  prefix'  structure  described  above.  The  precise  substructure  of 
a  given  article  is  shown  in  Fig.  7-5.  The  relative  amoung  of  storage 
occupied  by  each  of  the  types  of  information  is  given  in  the  table  of 
Fig.  7.6. 


Fig.  7.b.  Structure  of  raw  data  file 


article  node  (ident.  and  length)  -  5  4 


title 

21 

authors 

ill  4 

author  locations 

28  4 

citations 

32  % 

100  i 

Fig.  7.6.  Percent  of  storage  occupied  by  each  information  type. 


i .22  Inverted  Files 

An  inverted  file  is  a  type  of  index  to  the  raw  data  file.  For 
example,  one  might  create  an  inverted  author  file  by  extracting  from 
each  article  the  authors'  names.  These  names  could  be  alphabetized  and 
the  duplicates  deleted.  Such  a  file  would  have  the  structure  shown  in 
Fig.  7.7.  In  this  figure  nodes  D^. . .D^  are  the  identifications  of  the 
articles  written  by  Author  A  . 


inverted  author  file 
author  nodes 

articles 


Fig.  7.7.  Structure  of  inverted  author  file. 


Inverted  files  have  been  created  fer  title  words,  authors, 
locations,  nnd  citations.  Because  of  a  current  lack  of  storage  space, 
the  inverted  files  cover  only  a  part  or  the  total  raw  data  file.  This 
partial  coverage  was  fourd  to  he  sufficient  for  experimental  pr.oies, 


however. 
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On  the  basis  of  the  experience  gained  with  these  partially  completed 
inverted  files,  it  is  estimated  that  inverted  files  for  the  full  rav  data 
file  vill  increase  storage  requirements  by  the  percentages  gi.en  in 
Fig.  7.8. 

title  word  file  .  .  .  .  .  17.7 %  of  rav  data  file 
author  file  .......  15*3^  " 

location  file . lS.0%  " 

citation  file  ......  1(7. 5^  " 

Total . 95.5 £  " 

Fig.  7.8.  Storage  requirements  for  inverted  files. 

There  are  certain  additional  steps  that  can  be  taken  which  will 
probably  reduce  the  additional  storage  required  to  only  about  10%  of 
the  raw  data  file.  Thus  adding  inverted  file'  increases  storage  require¬ 
ments  by  a  factor  of  1.5-»2 .0.  It  is  suspected  that  the  amount  of 
storage  needed  for  file  inversion  is  a  relatively  standard  factor  for 
most  types  of  information.  Certainly  the  types  of  information  found  in 
the  test  file  of  this  project  (title,  words,  authors,  locations, 
citations)  varied  markedly  in  their  characteristics  but  still  followed 
roughly  this  factor  of  two  increase. 

Fig.  7.9  snows  that  the  relative  amount  of  storage  required  for  an 
inverted  author  file  decreases  as  the  size  of  the  file  increases,  lhe 
leveling  off  shown  leads  one  to  believe  that  an  order  of  magnitude 
increase  in  the  test  file  would  not  significantly  change  the  percent 
increase  in  storage  required  for  an  inverted  author  file.  A  similar 
leveling  off  was  found  for  title  words. 
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n  it  tt 


n  it  ti 


n  it  ti 


Inverted  Author  File  Size 

(Based  on  percent  of  raw  data  file  size) 


Ho.  years  of 
Physical  Review 
in  stack 


Fig.  7.9-  Storage  required  for  inverted  author  file. 

(For  articles  in  Physical  Review  1959-61*) 


There  is  a  good  theoretical  reason  why  the  inverted  files  should 
require  about  the  3ame  amount  of  storage  as  the  raw  data  itself.  The 
reason  is  that  the  inverted  files  store  the  same  information  aa  the  raw 
data  file  (except  perhaps  for  the  relative  order  of  some  of  the  data). 
Indeed  one  could  reconstruct  the  raw  data  file  from  the  inverted  files 
by  merely  collecting  together  the  title  words,  authors,  etc.  for  each 
article.  The  one  exception  to  the  equivalence  of  the  information  found 
in  the  two  types  of  files  concerns  order.  One  cannot  determine  from  the 
inverted  word  file  the  order  that  the  words  originally  had  in  the  titles 
of  the  raw  data  file,  but  only  which  words  belong  to  each  title.  Of 
course,  some  additional  provision  might  be  made  so  that  inverted  files 
contained  order  information  as  well  as  the  article  identifications. 
However  the  point  here  is  that  the  two  types  of  files  should  require 
about  the  same  amount  of  storage. 
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7.23  Linkage  Files 

A  linkage  file  contains  a  description  of  a  document  network  of  the 
type  described  in  Chapter  III.  The  basic  information  needed  to  describe 
such  a  network  consists  of  document  node  identifications  and  link  values. 

flhe  structure  of  a  linkage  file  is  shown  in  Fig.  7.10.  For  each 
document  node  in  the  network  there  is  an  entry  in  the  filw  which  consists 
of  the  identification  of  the  document  along  with  the  information  on  the 
links  emanating  from  the  node.  The  linkage  information  consists  of  the 
identifications  of  the  other  document  nodes  connected  to  the  node  in 
question  alonj  with  the  values  of  the  connecting  links.  In  such  a  file 
it  is  necessary  to  store  only  those  links  for  which  N^j/0  with  the 
understanding  that  the  value  of  all  other  links  is  K. 


Linkage  file: 

Document  nodes: 

Linkage  node  pairs 


Fig.  7.10.  Structure  of  Linkage  File. 


Note  that  the  information  on  each  link  is  specified  in  two  places 
in  a  linkage  file.  For  example,  the  value  of  C(x^Xj)  is  stored  in  the 
entry  for  document  and  also  in  the  entry  for  Xj.  This  redundancy 
makes  it  so  that  once  the  entry  on  a  given  document  is  located,  one 
immediately  knows  all  of  the  documents  to  which  it  is  linked  as  well 
SB  the  values  of  the  links. 


129 


In  an  attempt  to  gain  some  Insight  into  the  si2e  and  characteristics 
of  linkage  files,  a  test  was  conducted  on  one  volume  (Vol.  128)  of  the 
Physical  Review.  Linkage  files  were  created  based  on  each  of  the  five 
types  of  partitions  discussed  in  Sec.  6.22.  The  results  of  this  test 
are  summarized  in  Fig.  7.11. 


Partitioning  criterion  on 
which  links  are  based 

File  Size 
(Based  on  size  of 

Phys.  Rev.  Vol.  128) 

Percent  of  total 
possible  link* 

for  which  H. .^0 

-  -- -LaL 

(l)  Authors  (estimated) 

15#  of  raw  data  file 

1/2  # 

(2)  Title  words 

(for  words  occurring 
less  than  20  times) 

58#  "  " 

it# 

(3)  Cite-same 

2 It#  "  "  " 

1  1/2  * 

(lj )  Cited-by-same 

(Citations  to  v.128 
from  v.128 -133) 

it  w  n  m 

small 

(5)  Subject  Category 

175#  "  “ 

15# 

Fig.  7.11.  Ihble  of  linkage  file  sizes  for  vol.  128  of 
the  Physical  Review. 

Fig.  7.11  indicates  that  partitioning  criterion  (3)  generates  a 
network  in  which  about  1  l/2  #  of  the  links  have  values  other  than  K 
(i.e.  N.  ,/0).  This  is  for  a  single  volume  of  the  Physical  Review.  It 
would  seem  reasonable  that  this  percentage  would  be  somewhat  less  for 
the  total  document  file.  W»  shall  assume  in  the  analysis  of  the  next 
section  that  approximately  1#  of  the  possible  links  in  the  network  of 
the  total  file  have  non-K  values.  This  means  that  each  document  in  the 
T.I.P.  file  is  linked  to  about  (.0l)(26,000)-260  other  documents  on  the 
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7.2li  Request  -  Ansver  File 

The  actual  generation  of  this  type  of  file  was  never  seriously 
contemplated  because  of  the  immense  amount  of  processing  time  and  storage 
space  that  would  be  required.  It  is  described  here  because  it  represents 
an  extreme  case  to  which  we  wish  to  make  reference  in  the  next  section. 

A  request-answer  file  contains  the  answer  cluster  for  each  possible 
request.  Its  possible  structure  could  be  represented  by  Fig.  7.12. 

in  this  figure  are  the  documents  contained  in  the  particular 
answer  cluster  in  question. 

Request-answer  file 
Possible  request  nodes 
Answer  cluster  nodes 
Document  nodes 

Fig.  7.12.  Structure  of  request-answer  file. 

Retrieval  from  this  type  of  file  would  consist  of  a  simple  table 
look-up  for  the  request  and  then  presentation  of  the  associated  answer 
cluster. 

7.3  Storage  Systems 

The  overall  storage  system  selected  for  this  project  could  consist 
of  any  combination  of  one  or  more  of  the  types  of  files  described  in  the 
preceding  section.  For  purposes  of  discussion  and  comparison  let  us 
suggest  four  types  of  storage  systems.  The  first  three  were  implemented 
and  tested  to  some  extent.  System  (2)  is  the  one  that  was  finally 
selected  for  this  project. 
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(1)  Raw  data  file  only. 

(2)  Raw  data  file  and  Inverted  files. 

(3'  law  data  file  and  linkage  file. 

(It)  Raw  data  file  and  request-answer  file. 

The  r/aw  data  file  is  included  in  each  of  the  four  storage  systems 
so  that  information  on  specific  articles  can  be  presented  to  the  user  at 
any  time-  he  wants  it.  For  instance,  a  user  might  want  to  know  the  title 
and  author(t)  of  an  article  that  is  about  to  be  added  to  the  set  S. 

This  information  would  be  obtained  from  the  raw  data  file. 

Each  of  the  four  suggested  data  storage  systems  could  serve  as 
base  for  the  clustering  procedure  of  Chapter  V.  Iftere  are  some  signifi¬ 
cant  differences  in  the  characteristics  of  the  retrieval  system  that 
would  result,  however.  Let  us  indicate  some  of  the  differences  by  dis¬ 
cussing  four  important  characteristics  of  the  resulting  retrieval  systems. 

7.31  Storage  Space  Required 

Since  the  raw  data  file  is  basic  to  all  four  systems,  we  will 
express  storage  requirements  in  terms  of  the  sire  of  that  file.  It  has 
already  been  noted  that  the  inverted  files  require  about  as  much  storage 
as  the  raw  data  file.  If  we  make  the  assumption  that  1 %  of  all  possible 
links  have  non-K  values  as  was  suggested  in  Sec.  7.22,  then  the  linkage 
file  for  the  TIP  document  collection  would  be  about  six  times  as  larf  s 
as  the  raw  data  file.  If  we  assume  that  every  request  for  information 
consists  of  only  t.-i  documents  of  interest  and  every  answer  cluster 
contains  20  documents,  then  a  request-answer  file  would  be  about  35 
times  the  size  of  the  raw  data  file.  Much  more  space  would  be  required 
if  larger  requests  were  allowed.  These  figures  are  summarized  in 
Fig.  7.13. 
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(1) 

Raw  data  only 

100$  of  raw  data 

(2) 

Raw  data  plus  inverted 

fO 

8 

■SU. 

(3) 

Raw  data  plus  linkage 

700$  "  " 

a> 

Raw  data  plus  request-answer  . 

.3500$"  " 

file 


II 


II 


Fig.  7.13.  Comparison  01  storage  requirements  for  the  four 
types  of  data  systems. 


7.32  Processing  Time 

Let  us  next  determine  the  average  amount  of  processing  time  that 
would  be  needed  to  transform  a  request  into  an  answer  cluster  for  each  of 
the  proposed  storage  systems.  By  processing  time  we  mean  the  amount  of 
time  allocated  by  the  central  processor  of  the  Project  MAC  system  to 
running  the  clustering  program.  The  time  spent  in  swapping  the  program 
in  and  out  of  core  storage  is  excluded.  The  rario  of  the  real  time  that 
the  MAC  user  must  wait  to  the  processing  time  varies  with  the  number  and 
type  of  user 8  on  the  system  and  can  range  from  one  to  forty  or  fifty. 

The  time  required  to  access  a  piece  of  data  on  the  1302  disc  is 
about  l/2  second.  This  includes  both  the  time  spent  by  the  disc  control 
supervisor  and  by  the  disc  in  locating  and  reading  a  track.  Thus  the 
request-answer  system  would  require  about  a  second  in  order  to  find  an 
answer,  since  very  little  computational  or  manipulative  work  is  required. 

For  a  linkage  file  system  at  least  20  accesses  to  the  disc  would  be 
required  (for  a  cluster  of  20  documents).  This  would  involve  about  10 
seconds  of  processing  time  in  addition  to  some  computational  time  which 
was  found  to  be  small  in  comparison.  We  pick  15  seconds  as  the  average 
amount  of  time  required  to  find  a  20-document  cluster  if  linkage  files 
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are  available. 
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The  amount  of  processing  time  required  to  find  a  20-document 
cluster  with  an  Inverted  file  storage  structure  has  been  found  to  50-60 
seconds.  Ibis  Includes  60  or  so  accesses  to  the  disc  and  a  fair  amount 
of  manipulation  and  computation. 

If  only  the  raw  data  file  is  available,  then  one  must  pass  through 
the  total  data  fils  two  or  three  times  looking  for  documents  that  are 
linked  to  the  documents  in  sets  Y,  Z,  and  S.  One  complete  pass  through 
the  raw  data  file  takes  200-300  seconds.  Thus  the  average  processing 
time  would  be  on  the  order  of  600  seconds.  Fig.  7 .ill  summarizes  the 
processing  time  required  for  each  of  the  four  systems. 

(1)  Raw  data  only  600  aec. 

(2)  Raw  data  plue  inverted  60  " 

(3)  Raw  data  plus  linkage  15  " 

(li)  Raw  data  plus  request-answer  ...  1 

Fig.  7.1i*.  Average  processing  time  required  to  find  a 
cluster  of  20  documents  for  the  four  typeu 
of  storage  systems . 

7.33  Updating  and  Editing 

Besides  the  processing  time  Involved  in  answering  requests  there  is 
a  certain  amount  of  time  required  for  updating  and  editing  the  file, 
since  it  is  constantly  changing.  For  purposes  of  comparison  let  us 
consider  the  problem  of  adding  335  articles  (50  tracks  or  raw  data)  to 
an  existing  file  of  20,000  articles  (3000  tracks).  The  time  required  to 
load  and  structure  the  raw  data  file  will  not  be  considered  since  it  is 
common  to  all  four  storage  systems. 

In  order  to  update  the  inverted  files  one  must  extract  the 
appropriate  fields  from  the  new  raw  data,  sort  them  into  the  desired 
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sequence*  and  merge  the  sorted  data  with  the  old  Inverted  files.  Hie 
current  programs  for  doing  this  would  take  about  1*00  reconds  for  the  50 
tracks  of  data.  The  time  needed  for  each  information  type  is  as  follows: 
words  -  90  sec.,  authors  -  5ft  sec.,  citations  -  210  sec.,  locations  - 
50  sec.  The  time  for  each  process  is  as  follows:  extraction  -  ?5  sec., 
sorting  -  150  sec.,  merging  -  230  sec. 

Consider  the  problem  of  updating  a  linkage  file  with  the  links  based 
on  whether  or  not  two  papers  cite  the  same  paper  (partition  type  (3)  in 
Sec.  6.22).  Up  ing  can  be  accomplished  by  the  following  steps.  First, 
extract  the  citations  frost  the  50  tracks  of  new  articles.  Sort  these 
citations  and  compare  them  with  the  total  raw  data  file  to  determine 
which  articles  are  linked  to  each  new  article,  taring  this  comparison 
process  generate  a  file  of  information  on  the  new  links.  Sort  this  file 
and  merge  it  into  the  old  linkage  file.  The  programs  which  were  written 
to  perform  this  updating  process  were  only  tested  on  smill  files  of 
several  hundred  articles.  Let  us  extrapolate  the  results  and  estimate 
how  long  it  would  take  to  update  the  linkage  file  for  the  case  under 
consideration.  Extracting  and  sorting  the  citations  of  the  335  new 
articles  would  take  about  100  seconds.  Matching  the  citations  with  the 
total  raw  data  file  would  take  about  1800  seconds  and  merging  them  into 
the  old  linkage  file  would  require  sbout  1200  seconds  for  a  total  of 
uOOO  seconds. 

The  amount  of  time  required  to  update  a  request-answer  file  would 
be  more  of  a  guess  than  *n  estimate.  It  would  take  at  least  7000 
seconds  to  rewrite  the  file  and  probably  10  to  100  time*  more  to  find 
all  the  cluster*.  These  figures  are  tabulated  in  Fig.  7.l5  for  ease  in 
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(1) 

Raw  data  only 

0 

(2) 

Raw  data  plus  inverted 

bOO 

(3) 

Raw  data  plua  linkage 

bOOO 

(U) 

Raw  data  plus  request-answer  .  .  . 

.  7000 ♦ 

Fig.  7.15.  Processing  tiae  required  to  update  a  file  of  2000 
articles  with  335  new  articles  for  each  of  the 
four  storage  systeas. 

7.3k  Flexibility  and  Compatablllty 

So  far  we  have  bee'  vainly  concerned  with  how  much  storage  space 
and  processing  tiae  Is  required  for  a  system  which  finds  nsv-sr 
clusters.  Actually  the  process  of  finding  clusters  as  proposed  in  this 
thesis  is  not  considered  to  be  the  only  retrieval  tool  which  will  be 
mads  available  to  the  uaer.  Rather  clustering  is  looked  upon  as  one 
possible  component  in  a  larger,  aore  general  retrieval  systen.  It 
follows  that  the  storage  structure  of  the  data  should  not  be  designed 
with  Just  the  clustering  process  in  Bind,  but  it  should  be  chosen  on  the 
basis  of  its  utility  and  adaptability  to  a  large  class  of  retrieval 
functions. 

Even  if  the  data  file  for  the  experimental  system  were  to  be  used 
exclusively  for  clustering,  it  would  still  be  useful  to  make  the 
struct)  c  selected  as  general  as  possible.  One  reason  why  this  is  to 
sterna  from  the  fact  that  say  experimental  system  is  generally  in  a 
constant  state  of  flux  and  soy  rigid  or  specialised  dsu  structure  may 
soon  be  rendered  obsolete. 

let  us  suggest  that  the  following  objective  might  yield  a  data 
storage  structure  thick  would  provide  er  adequate  base  (or  a  large 
number  of  different  retrieval  functions  and  at  the  seme  time  strike  a 


136 


suitable  compromise  between  storage  and  time  requirements. 

“The  amount  of  storage  required  should  be  minimized 
subject  to  the  restriction  that  at  no  time  should  one  have  to 
serially  search  through  the  total  file  to  obtain  a  given 
piece  of  information.  By  serial  search  we  mean  a  sequential 
examination  of  every  article  in  the  file." 

7. It  Selection  of  Storage  System 

from  Sec.  7.31  and  7.32  it  is  evident  that  no  data  structure  will 
at  the  same  time  minimize  the  processing  time  and  storage  6 pace  re¬ 
quired.  Same  type  of  engineering  compromise  is  needed.  Ibis  compromise 
must  be  influenced  by  such  factors  as  the  characteristics  of  the  compu¬ 
tational  facilities  to  be  used  and  by  the  type  of  retrieval  service  that 
is  to  be  offered.  One  must  also  consider  the  costs  involve!  in  updating 
the  file  and  how  often  updating  is  to  be  performed.  The  decision  is 
further  complicated  by  the  fact  that  the  structure  selected  should  be 
compr.^lble  with  other  retrieval  functions  and  flexible  to  change. 

A  storage  system  consisting  of  the  raw  data  only  requires  the  least 
amount  of  storage  space  and  the  least  effort  to  update.  Its  major  draw¬ 
back  is  in  the  time  required  to  answer  a  request.  Even  now  with  the 
current  file  of  about  26,000  articles  the  time  required  to  find  informa¬ 
tion  is  generally  too  great  to  allow  for  close  man-machine  coupling. 

And  if  the  file  size  were  to  increase  by  an  order  of  magnitude,  a  system 
based  on  this  structure  would  certainly  be  too  slow. 

The  linkage  and  request-answer  files  have  excellent  response  times 
but  require  an  excessively  large  amount  of  storage  space  and  are  very 
hard  to  update.  In  addition  they  are  designed  specifically  for  the 


137 


purpose  of  finding  clusters  and  have  little  or  no  real  value  to  other 
retrieval  operations. 

the  second  type  of  data  storage  system  consisting  of  the  raw  data 
file  end  the  inverted  files  was  the  one  selected  for  this  project.  Its 
storage  requirements  were  less  than  double  that  required  for  the  raw 
deta  file  alone,  the  processing  tine  required  to  find  a  cluster  was 
high,  but  not  so  high  as  to  exclude  close  man-machine  interaction,  and 
it  appears  that  an  order  of  magnitude  increase  in  the  file  size  would 
not  appreciably  increase  these  time  requirements.  Updating  of  the 
system  could  be  done  on  a  daily  or  weekly  basis  without  consuming  an 
excessive  amount  of  computational  effort.  The  '  t ru  ire  is  also  useful 
in  a  large  number  of  other  retrieval  operations  as  will  become  more 
obvious  in  the  next  chapter. 

7.5  High  Speed  Storage  Structure 

So  far  in  this  chapter  we  have  discussed  how  the  data  should  be 
structured  for  permanent  storage  on  the  disc.  A  related  problem  con¬ 
cerns  the  form  the  data  should  take  once  it  has  been  selected  for 
processing  and  is  loaded  into  high  speed  core  storage. 

The  approach  that  was  used  in  the  earlier  versions  of  the  experi¬ 
mental  system  was  to  convert  the  data  to  a  "list"  structure  as  it  was 
loaded  into  core.  This  involves  associating  one  or  more  address 
pointers  with  each  piece  of  data.  The  pointers  preserve  the  original 
sequence  of  the  data  without  requiring  that  it  occupy  contiguous  loca¬ 
tions  in  memory.  One  of  the  major  advantages  of  such  a  structure  is  the 
relative  ease  with  which  the  data  can  be  re-arranged  and  with  which 
particular  pieces  of  data  can  be  added  and  deleted.  Some  of  the 
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programing  languages  that  have  teen  developed  to  facilitate  the  creation 
and  manipulation  of  list  structures  are  CGKLT,  LISP,  SUP,  and 

It  was  later  decided  that  the  added  flexibility  obtained  through 
the  uae  of  list  structures  was  not,  in  general,  needed  for  library-type 
data  that  remains  relatively  fixed.  Indeed  the  processing  tine  required 
to  reformat  the  data  into  lists  was  considerable.  Therefore  the  approach 
that  was  finally  adopted  vas  to  leave  the  data  in  core  in  the  same  form 
that  It  was  an  the  disc. 

It  is  actually  easier  to  perform  some  of  the  operations  needed  in 
the  formation  of  s  cluster  on  this  disc  structure  than  it  is  to  do  them 
on  the  equivalent  list  structure.  Take, for  example,  the  calculation  of 

the  H.  '8.  For  the  partitioning  criterion  selected  this  would  involve 

*•0 

the  comparison  of  two  tables  of  citations.  The  most  efficient  way  that 
has  been  found  to  do  thii  is  to  have  the  citation  codes  of  each  article 
in  numeric  order  on  the  .Use,  end  to  Eake  a  single  synchronous  pass 
through  the  two  tables  tallying  tne  number  of  matching  entries.  The 
time  required  to  do  this  match  if  the  data  has  a  list  structure  would 
probably  at  least  double.  Wiere  are  also  certain  other  operations  (e.g. 
binary  or  logarithmic  seaiches)  for  which  a  list  structure  is  not  veil 
suited. 

For  the  final  version  of  the  experimental  system  a  rather  simple 
storage  allocation  system  was  adopted  which  kept  track  of  the  available 
free  core  storage.  Through  this  system  blocks  of  storage  could  be 
allocated,  changed  in  size,  or  freed  up  for  other  uses.  Reference  to 
each  block  vas  through  a  numeric  code  so  that  the  actual  address  of  the 
block  could  change.  This  made  it  so  that  all  the  free  storage  could  be 
kept  in  one  contiguous  block.  Data  from  the  disc  was  loaded  irto  these 
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blocks  of  storage  and  processed  there. 

The  S,  Y,  and  Z  document  sets  were  also  placed  in  blocks  obtained 
from  the  storage  allocator.  It  was  later  decided,  that  this  was  a 
distinct  disadvantage  to  the  system  because  the  sets  vere  constantly 
changing  and  should  have  had  tb»  flexibility  available  from  a  list 
structure. 


CHAPTER  VIII 
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INTERACTION  LANGUAGE 

The  description  of  the  experimental  system  is  now  almost  complete. 
The  clustering  procedure  which  is  used  in  answering  requests  has  been 
defined  in  Chapter  V.  The  computational  facilities  and  data  base  on 
which  the  system  operates  have  been  described  in  Chapter  VI.  In  Chapter 
VII  the  way  the  data  is  structured  was  explained. 

The  one  aspect  of  the  experimental  system  that  haB  not  been  covered 
concerns  the  interface  between  the  user  and  the  system.  In  this  chapter 
we  will  describe  the  language  which  permits  the  U6er  to  communicate  and 
interact  with  the  system. 

8.1  Background  to  Language 

As  a  way  of  introducing  the  language  we  will  present  in  this 
section  some  of  the  general  design  objectives  that  were  selected  for  the 
language  and  an  example  of  a  typical  interaction  using  the  language. 

8.11  Design  Objectives  of  Language 

Hie  first  retrieval  language  developed  for  this  project  was 
designed  specifically  for  clustering  and  bore  little  resemblance  to  the 
language  used  by  the  Technical  Information  Project  programs  in  performing 
the  more  conventional  matching  functions  (author,  citation,  and  keyword 
searches,  bibliographic  coupling,  etc.).  It  was  found  to  be  inconvenient 
and  confusing  to  have  to  shift  from  une  program  and  one  language  to 
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another  program  and  another  language  every  time  one  wanted  to  6hift  from 
a  clustering  request  to  a  T.I.P.  request  and  vice  versa.  It  was  decided 
that  the  tame  general  language  should  be  used  for  both  functions.  Ibis 
goal  is  related  to  the  idea  expressed  in  the  last  chapter  that  the 
clustering  function  should  be  considered  a  component  of  a  larger  re¬ 
trieval  system  (Sec.  7.3U).  Hot  only  should  the  data  structure  be 
designed  for  the  larger,  more  general  system,  but  the  retrieval  language 
should  also.  In  the  remainder  of  the  chapter  the  clustering  and  matching 
functions  will,  therefore,  be  treated  equally. 

In  addition  to  having  adequate  expressiveness  for  the  current 
clustering  and  T.I.P.  commands,  it  was  considered  desirabl*  that  the 
language  be  flexible  enough  oo  that  it  might  be  easily  extended  to  other 
types  of  retrieval  operations. 

A  second  objective  of  the  language  i.  that  it  should  be  easy  to 
learn,  use,  and  remember.  It  was  decided  that  if  the  vocabulary  and 
syntax  of  the  language  resembled  normal  English  It  would  be  easiest  to 
learn  and  remembeA1  However,  it  was  found  to  be  rather  tedious  after  a 
while  to  have  to  type  a  complete  English  sentence  for  each  request.  An 
abbreviated  version  of  the  language  was,  therefore,  developed  for  the 
experienced  user  which  allowed  much  of  the  vocabulary  to  be  abbreviated. 
Hie  abbreviated  version  was  such  that  one  could  make  a  smooth  transition 
from  the  full  English  request  to  the  abbreviated  request  as  he  became 
more  familiar  with  the  system.  An  example  of  a  complete  request  and  the 
equivalent  abbreviated  request  follow. 

"Print  the  authors  and  locations  of  all  the  articles  cited  by  the 


article,  Physical  Review,  volume  135,  page  3 
"p  art  loc  of  art  cited  by  1  135  1." 


A  third  goal  of  the  language  is  that  it  be  simple  enough  to  process 
efficiently  and  qu.  :.Uy,  Even  a  rather  complex  request  in  the  language 
that  was  adopted  takes  much  less  than  a  second  of  central  processor 
time  to  interpret. 

8.12  Example  of  Language 

In  Fig.  8.1  is  an  example  of  an  Interaction  that  might  occur 
between  a  user  and  the  system.  The  lines  that  the  user  types  are  under¬ 
lined.  First  he  initiates  the  MARS  (Machine  Aided  Retrieval  System) 
program.  We  assume  that  the  one  fact  the  user  knows  is  that  he  is 
interested  in  something  about  Langmuir  probes.  He  could  Just  as  well 
have  known  an  author  or  paper  that  interested  him  or  perhaps  a  combina¬ 
tion  of  these. 

In  the  first  command  he  asks  for  a  list  of  those  articles  containing 
the  word,  "Langmuir",  in  their  titles.  Let  us  say  that  after  examination 
of  the  list  produced,  the  user  decides  that  the  papers  by  three  of  the 
authors  are  the  most  interesting.  He  now  asks  for  all  papers  written  by 
these  three  authors  (that  have  not  already  been  retrieved). 

Next  we  assume  that  the  user  selects  two  of  the  papers  as  of 
particular  interest  and  wishes  to  form  a  cluster  around  them.  Further 
he  decides  that  one  of  the  papers  is  definitely  not  what  he  wants  and 
he,  therefore,  specifies  that  it  is  not  of  interest.  A  close  interaction 
sequence  follows  with  the  system  presenting  papers  that  are  about  to  be 
added  to  or  deleted  from  the  set  S  and  the  user  deciding  which  are  of 
interest  and  which  arc  not. 

Finally  a  cluster  is  formed  and  the  user  stores  it  on  the  disc  for 
future  reference.  He  then  analyzes  itt  laracteristics  by  making  various 
lists  of  frequency  counts. 
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Fig.  8.1.  Example  of  possible  user  interaction  with  data 
using  retrieval  language. 

(Lines  typed  by  user  are  underlined.) 

8.2  Description  of  Language 

Two  methods  of  describing  the  retrieval  language  have  been 
selected.  In  the  first  the  syntax  of  the  language  is  described  by 
means  of  a  finite  state  (sequential)  machine?"*  In  the  second  the  syntax 
and  vocabulary  are  defined  by  means  of  Backus  normal  (ALGOL  60)  notation. 
The  equivalence  of  these  two  descriptions  is  also  shown. 

8.21  Finite  State  Machine  Description 

There  are  a  number  of  different  methods  that  could  be  used  to 
letcribe  the  retrieval  language  that  was  developed  for  this  project. 
Perhaps  the  most  appropriate  way  to  describe  the  syntax  of  the  language 
would  be  to  present  the  same  table  that  is  actually  used  by  the  inter¬ 
pretive  part  of  the  retrieval  system.  Fig.  8.2  is  the  syntax  table 
which  has  been  extracted  from  a  program  listing.  It  is  a  tabular 
description  of  a  finite  state  machine^.  The  first  column  contains  the 
identifications  of  the  various  states.  Column  two  pertains  to  one  of 
the  languages  used  to  write  the  system  (it  is  the  name  of  a  MACRO  in  FAP) 
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and  is  not  pertinent  to  our  discussion  here.  The  third  column  contains 
the  valid  state  transitions  that  can  occur.  For  example,  the  entry 
(V,2)  for  SI  means  that  the  machine  will  change  from  state  SI  to  S2  if 
the  input  signal  is  V  (verb). 


SI 

STATE 

( (V,2  )(X,l)(A,l) ) 

S2 

STATE 

((V,2)(C,3)(N,li)(L,8)(E,10)(X,2)(A,2)) 

S3 

STATE 

((V,2)(X,3)(A,3)) 

SU 

STATE 

((N,li)(C,5)(?,6)(X,M(A,M) 

s5 

STATE 

((n,M(x,5)(a,5)) 

S6 

STATE 

((N,7)(X,6)(A,6)) 

S7 

STATE 

((P,6)(L,8)(X,7)(A,7)) 

S8 

STATE 

((L,8)(C,9)(E,10)(X,8)(A,8)) 

S9 

STATE 

((P,6)(L,8)(X,9)(A,9)) 

S10 

STATE 

0 

Fig.  8.2 

.  Finite  state  machine  description  of  syntax 

of  retrieval  language. 

Fig.  8.3  is  the  state  diagram  for  the  machine  of  Fig.  8.2.  We  have 
left  off  the  self  loops  on  each  state  due  to  the  X  and  A  inputs  to  keep 
from  cluttering  up  the  diagram.  Also  not  shown  is  the  sink  state  which 
the  machine  enters  when  the  input  sequence  being  analyzed  has  f.n  invalid 
syntax.  For  example,  if  the  machine  is  in  state  S2  and  the  input  signal 
is  a  P,  then  the  sink  state  is  entered.  The  initial  or  starting  state 
of  the  machine  is  S^.  The  final  or  accepted  state  is  S^q.  Thus  an 
input  sequence  is  considered  to  have  an  acceptable  syntax  if  it  trans¬ 
forms  the  machine  of  Fig.  8.3  from  to  S1Q. 
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Fig.  6.3.  Finite  State  Diagram  for  the  Thble  of  Fig.  8.2. 

(Transitions  not  shown  go  to  an  error  or  sink 
state. ) 

The  input  symbols  of  Fig.  8.2  and  8.3  represent  classes  of  words. 
Fig.  8.U  gives  the  general  titles  and  some  examples  of  the  classes,  Hie 
interpretive  procedure  first  classifies  each  word  in  the  input  statemt.it 
into  one  of  the  classes  and  then  checks  the  syntax  by  the  Table  of 
Fig.  8.2.  In  Fig.  8.5  we  present  a  specific  exanple  of  an  acceptable 
ond  an  unacceptable  statement. 

Specific  Examples 
print,  count 
article,  title 
by,  of 
first,  last 
and,  or 
the,  e 

Jones,  laser 
.(carriage  return) 


Ir jut  Symbol  Class  Hame 
V  Verbs 

N  Nouns 

P  Prepositions 

A  Adjectives  and  Adverbs 

C  Conjunction 

X  Filler  Words 

l  Undefined  (literal)  words 

E  Terminator 


Fig.  8.U.  Classes  of  Input  Symbols. 
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Stateaent:  Count  the  articles  by  John  Jones. 

Word  classes:  V  X  R  P  L  L  E 

States  traversed:  31  $2  S2  S^  S6  Sg  Sg  S1Q 

Statenent:  Print  the  titles  of  articles  and. 

Word  classes:  V  X  R  P  M  C  E 

States  traversed:  S^  S2  S^  S^  Sg  S^  Sink  State 

Pig.  8.5*  Example  of  stateaent  with  acceptable  syntax 
and  stateaent  with  unacceptable  syntax. 

Let  us  coaaent  briefly  on  the  purpose  of  each  state  in  the  diagram 
of  Fit;.  8.3.  Preliminary  to  doing  -his  It  should  be  noted  that  mere 
are  generally  three  main  parts  to  an  acceptable  stateaent  (request): 

(1)  Verb  (states  .  and  S^) 

(2)  Direct  object  (states  S,^  and  S^) 

(3)  Modifying  phrase  (states  Sg  S^) 

State  S,  is  the  starting  state  of  the  machine.  State  Sj,  requires  that 
each  request  begin  with  a  verb  describing  what  the  system  should  do. 

The  verb  can  be  either  simple  (e.g.  print)  or  compound  (e.g.  count  and 
save).  State  exclude j  the  possibility  of  a  double  conjunction 
between  elements  of  a  compound  verb  (e.g.  print  and  or  store).  It  also 
prevents  the  verb  from  ending  in  a  conjunction. 

State  requires  that  the  next  part  of  a  request  be  s  list  of  one 
or  more  nouns  signifying  the  type  of  information  that  is  to  be  produced 
by  the  system.  Tbit  ear.  again  be  simple  (e.g.  title)  or  compound  (e.g. 
title,  authors,  end  locations).  State  has  a  purpose  similar  to  S}. 

The  laat  pert,  cf  the  requeat  la  the  modifying  phrese  which 
containa  the  structure  of  the  articles  and  other  entitles  that  *ru 
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specified  by  the  user  in  making  the  request.  States  Sg  and  allow 
the  request,  to  have  a  conplex  structure  with  several  levels  of  preposi¬ 
tional  phrases  modifying  other  phrases.  For  example,  one  could  find 
the  co-authors  of  a  given  author  by  the  request:  "Find  the  authors  of 
articles  by  John  Jones." 

States  Sg  and  allow  the  user  to  specify  some  logical  combination 
of  a  number  of  specific  fields.  For  example:  "Print  the  a-ticles  by 
John  Jones  and  Robert  Smith  but  not  Joseph  Adams.” 

The  E  transition  from  to  is  so  that  certain  commands  will  be 
accepted  that  consist  of  a  verb  only.  The  LE  transition  between  and 

allows  for  an  abbreviated  male  of  reference  to  certain  data  (e.g. 
Print  set  3.).  Adjectives  and  adverbs  can  occur  anywhere  in  a  request 
and  car.  modify  verbs,  nouns,  etc. 

8.22  Backus  Hormal  Description 

Let  us  leave  the  finite  state  description  of  the  syntax  of  the 
language  now  and  provide  a  more  conventional  description.  Tne  statements 
of  Fig.  8.6-8  constitute  the  Backus  normal  (ALSOL  60)  description  of 
the  language.  In  this  notation  means  "is  defined  to  be",  "  |" 

means  "or",  and  encloses  the  defined  elements  of  the  language^. 

Two  additional  explanations  are  necessary  for  the  Backus  normal 
description  of  Fig.  8.6-8.  All  elements  (words)  in  the  statements  are 
separated  by  one  or  more  word  separators  (blanks,  commas  or  periods) 
except  in  the  definitions  for  ^word^  and  ^integer^  where  the  characters 
have  no  separation.  Adjectives,  adverbs,  and  filler  words  can  occur  at 
any  point  in  a  request,  but  this  fact  is  omitted  from  the  description  to 
simplify  its  statement. 
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(vocabulary  word)  (verb) [(conjunction)  (noun)[ (preposition) | 
(adjective)  |(adverb)  (filler)  |  (terminator) 

(>,erb)  (find  verb)  | (print  vert)  | (delete  verb)  | (save  verb)| 

(read  verb)  |(other  verb) 

(find  verb)  : :«  count  |  find  j  fetch  |  f  |  get  [  g  |  keep 

(print  verb)  list  [  print  [  p 

(delete  verb)  ::»  delete 

(save  verb)  ::«  dump  |  save  |  store 

(read  vert)  read 

(other  verb)  load  |  return  j  search  |  trace  |  unload  |  yes  j  no  |  skip 

(conjunction)  and  |  and  not  |  but  not  |  not  )  or 

(noun)  (article  nour.)  J  (title  noun)l(word  noun)|(author  noun)  | 
(location  no  in)  [(citation  noun) 

(article  noun)  art  (  article  \  articles  |  doc  [  document  [documents  j 
id  |  ids  |  identification  |  identifications  [  paper  J 
papers 

(word  noun)  : :»  keyword  |  keywords  |  word  [  words 
(author  noun  )  : : »  aut  j  author  j  authors 
(Location  noun)  : :*  loc  J  location  |  locations 

(citation  noun)  biblio  j  bibliography  |  bibliographies  |  cit  |  citation | 
citations  |  ref  |  reference  |  references 

(preposition)  (article  preposition) |  (word  preposition)  | 

(author  prepoaition) ((location  preposition)  I 
(citing  preposition)  ((cited  by  preposition)) 

(set  preposition)} (clustering  preposition) 

(article  preposition)  of  |  used  by 

(word  preposition)  contain  [contains  J  containing  |  use  |  using 

(author  preposition)  ::«  by 

(location  preposition)  at 

(citing  preposition)  ::«  cite  |  citing 

(cited  by  preposition)  cited  by 

(set  preposition)  ::«  in 

(clustering  preposition)::-  related  to  |  related  by  authors  to| 

related  by  citations  to 

(filler}  : :*  a  |  all  |  all  of  |  an  |  any  |  any  of  |  are  |  been  |  each  |  every  | 
have  )  is)  the  |  this  j  these|  those  |  were  |  written 

(adjective)  ::*  first  |  last|  most  recent 
(adverb)  by  frequency!  for  decision 
(terminator)::-  •+>  (%>  is  a  carriage  return 

Fig.  8.7.  Backus  normal  statements  describing  vocabulary  of  language. 
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(literal}  particle  literal} |(word  lltera]>|<author  literal}) 
(location  litera]}|  (set  literal} 
particle  literal}  (journal  (volum^(pag^ 

(word  literal}  (literal  string} 

(author  literal}  : :■  (literal  string} 

(location  literal}  (literal  string} 

(set  literal}  set  (integer} 

(journal}  :  :*=  (Journal  name}  |  (alphabetic  cod^  |  (  numeric  code} 

(journal  name}  Phys.  Rev.  |  Physical  Review  |  ...  | Physics  of  Fluids 

(alphabetic  code}  phyrev  j  phyreb  |  ...  |  spjetp 

(numeric  code}  : :*  (integer} 

(volume}  (word}  (.ntegei}  j  ((integer} 

(page}  (word} (.ntegei}|  (integer} 

(literal  string}  (word  string}  |  (/ord  string} 

(the  first  word  string  in  this  definition  cannot  include  a 
vocabulary  word . ) 

(word  string}  : :  *  (word}  j (rord  strin$}(wora} 

(word}  : (character}  |  (haractei}  (charactet}  |  (haractei}  (character} 
Character}) . . . 

(integer}  (digit}  |(ligit} (digi}j  (digit} (digit}(digit} J . . . 
(character}  (letter}  ((digit} | (special  character} 

(letter}  : :»  a  |  b  j  . . .  |  z 
(digit}  0  jl  j  ...  |  9 

(special  character}  -|  /(  ■  (*  |  :  j  j  j  ... 

(word  separator}  (blank)  |  ,j  . 


Fig.  8.8,  Backus  normal  description  of  literals. 
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8.23  Equivalence  of  Descriptions 

'Rie  equivalence  of  the  Backus  normal  definition  of  Sec.  8.22  to 
the  finite  state  diagram  of  Sec.  8.21  can  be  shown  by  auccessively 
applying  the  four  transfor.  ■»  ti  ins  of  Fig.  8.9  to  the  statements  cf 
Fig.  8.6.  Fig.  8.10  is  a  brief  outline  of  the  steps  which  would  be 
taken  in  this  process.  One  is  referred  to  the  literature  for  an 
explanation  of  the  additional  concepts  (e.g.  non-determiniatic  machines, 
equivalent  states,  etc.)  introduced  in  this  Figure. 


Backus  Hormal 

(1)  A : :  -B  j  C 

(2)  A: :«BC 

(3)  A: :«AB  |  C 
(h)  A::-BAjc 


Finite  State _ 

o J~+0  — ^ 

0—*0  -> 
o-A— *c>  — ^ 

B 


Fig.  8.9*  Rules  for  transforming  Backus  normal  statements 
to  finite  state  diagram. 


8.3  Interpretive  Algorithm 

In  this  section  we  will  describe  how  the  retrieval  system  inter¬ 
prets  and  processes  the  language  of  Sec.  8.2.  The  discussion  will 
initially  cover  soe»e  general  aspects  of  requests  and  of  the  words  that 
they  contain.  Sections  8. 32-8. 3k  -ill  describe  the  various  functions 
that  requests  can  perform  (the  verb),  the  types  of  data  that  can  be 
generated  as  output  (the  direct  object),  and  the  structure  that 
specifies  the  actual  request  (the  modifying  phrase). 
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Fig.  8.10.  Outline  c t  »t«p»  proving  equivalence  of  Backue-nornal 
ted  finite  ttate  description!. 
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8.31  Vocabulary  and  Literals 

A  request  consists  of  one  or  more  lines  of  characters  that  the  user 
types  on  his  time -sharing  console.  The  maximum  length  of  a  request  is 
currently  1*00  characters.  The  end  of  a  request  Is  indicated  by  a  period 
followed  by  a  carriage  return.  The  request  character  string  is  Initially 
broken  up  into  words.  Words  are  defined  to  be  character  strings 
separated  by  blanks,  commas,  and/or  periods.  There  are  two  types  of 
words:  those  found  in  the  vocabulary  table  and  those  not  found  in  the 
table.  All  words  not  found  in  the  table  are  called  literals.  Their 
function  is  to  specify  the  particular  authors,  title  words,  citations, 
etc.  that  the  user  wishes  to  designate  in  defining  his  request.  The 
vocabulary  words  ore  for  indicating  the  function  and  structure  of  the 
request. 

In  some  cases  a  user  may  want  to  use  one  of  the  words  in  the 
vocabulary  table  as  a  literal.  For  example,  he  may  want  to  find  all 
titles  that  contain  the  vocabulary  word,  "store".  To  do  this  he  can 
explicitly  specify  the  word  as  a  literal  by  the  use  of  the  literal  mark, 

"  '  ".  For  the  above  example  the  user  would  say,  "print  the  titles  of 
all  articles  containing  'store'  .” 

Hote  that  the  retrieval  system  makes  no  distinction  between  lower 
and  uppercase  letters.  The  T.I.P.  file  does  not  contain  information  on 
whether  a  letter  is  lower  or  upper  case  either, 

8.32  Available  Functions 

The  verb  part  of  each  request  specifies  the  particular  operation  or 
operations  that  are  to  be  performed.  For  sxas^le,  if  the  user  wants  the 
results  of  the  search  to  be  printed  on  hi s  time-sharing  console,  he 
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would  use  the  verb,  "print".  There  are  currently  twenty- three  verbs  in 
the  vocabulary  and  thirteen  different  functions  that  they  specify.  Let 
us  describe  five  of  the  thirteen  functions. 

(1)  Scratchpad  Storage 

One  of  the  most  useful  features  of  the  retrieval  system  Is  its 
scratchpad  storage  capability.  Basically  this  involves  the  storage  in 
core  memory  of  various  kinds  of  data  for  later  reference,  for  example, 
one  can  create  in  scratchpad  storage  a  file  of  all  articles  written  by  a 
given  author  by  the  command,  "Find  the  articles  by  John  Jones."  After 
creating  the  set,  the  system  tells  the  user  its  size  and  identification 
number  (e.g.  b  articles  in  set  3).  Later  on  the  user  could  find  out 
what  articles  cite  articles  by  John  Jones  by  the  request,  "Print  the 
articles  citing  articles  in  set  3,"  or  Just  "p  art  citing  set  3." 

Each  data  set  in  scratchpad  storage  is  currently  homogeneous  with 
respect  to  the  type  of  information  it  contains.  In  other  words  one 
could  not  create  a  set  that  consisted  of  both  author  and  citation  data. 

Some  of  the  verbs  that  create  sets  in  scratchpad  storage  are: 
count,  find,  fetch,  f,  get,  g,  and  keep.  These  words  are  completely 
equivalent  so  far  as  the  system  is  concerned. 

(2)  Console  Print-out 

The  verbs  that  will  cause  the  data  in  question  to  be  printed  on  the 
user's  console  are  list,  print,  and  p.  A  scratchpad  set  will  also  be 
automatically  created  (if  the  output  is  homogeneous  end  if  it  isn't 
already  a  set). 

The  first  line  of  each  print-out  consists  of  the  number  of  items 
that  will  follow.  Thus  the  user  is  always  aware  of  the  ultimate  size  of 
the  listing  and  can  Interrupt  it  if  he  wishes. 
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(3)  Delete  Data  Sets 

Sets  or  groups  of  sets  can  be  erased  from  scratchpad  storage  by 
commands  such  as  "Delete  set  lj",  "Delete  all  sets." 

(li)  Save  Data  Sets 

Any  scratchpad  data  set  can  be  placed  on  the  disc  for  permanent 
storage  by  the  verbs  save,  store,  or  dump.  The  form  of  the  command 
would  bes  "Save  set  2." 

(5)  Read  Data  Seta 

Data  sets  that  have  been  stored  on  the  disc  by  the  above  command 
can  be  written  back  into  scratchpad  storage  by  commands  of  the  type: 

"Read  set  6." 

The  functions  of  some  of  the  verbs  can  be  modified  by  adverbs  or 
adverbial  phrases.  Let  us  describe  two  such  modifications  that  have 
been  implemented. 

(1)  Frequency  Lists 

The  print  verb  can  be  modified  to  list  items  in  terms  of  their 
frequency  of  occurrence  in  the  data  from  which  they  are  extracted.  For 
example,  the  command,  "Print  frequency  of  title  words  in  Phys.  Rev. 

Vol.  132."  would  produce  a  list  of  the  number  of  times  each  word  appears 
in  the  titles  of  articles  in  Phys.  Rev.  Vol.  132  (most  frequent  first 
and  alphabetical  within  the  same  frequency). 

(2)  Decision  Print-outs 

The  print  verb  can  also  be  modified  so  that  there  is  ■>  pause  after 
each  item  is  printed  out  to  allow  the  user  to  decide  upor  and  respond  -o 
toe  item.  This  would  be  the  command  used,  for  example,  by  a  uswr  who 
wished  to  be  coupled  into  ’-he  clustering  procedure.  For  the  command. 


is 


"Print  for  decision  the  titles  of  articles  related  to  Muovo  Cimento 
Vol.  30,  page  1.",  the  procedure  would  pause  after  printing  the  title  of 
each  article  about  to  be  added  to  or  deleted  from  the  set  S  and  allow 
the  user  to  place  the  article  in  the  1  or  Z  set  if  he  wished. 

8.33  Data  Generated 

The  second  part  of  the  request  is  the  direct  object  of  the  verb. 

It  is  a  list  of  the  types  of  information  (nouns)  that  the  user  specifies 
he  wants  in  the  system's  response  to  the  request.  Pig.  8.7  Indicates 
six  different  types  of  nouns  that  can  be  used  for  this  purpose  (article, 
title,  word,  author,  location,  and  citation  nouns).  The  correspondence 
of  these  words  to  the  various  types  of  data  found  in  the  T.I.P.  file  is 
fairly  obvious.  Any  combination  of  these  types  of  data  can  be  printed 
on  the  user's  console,  but  only  one  type  can  be  put  in  scratchpad 
storage  for  a  given  request.  The  form  of  the  data  as  it  is  printed  on 
the  console  is  shown  in  Pig.  6.U.  The  data  placed  in  scratchpad  has  the 
single  level  structure  indicated  by  Pig.  8.11  (see  Sec.  7.1). 

Set  Mode: 


Author  Mame  Modes: 

Fig.  8.11.  Pile 

3.3h  Request  Structure 

The  third  end  final  component  of  the  request  is  the  phrase  which 
modifies  the  direct  object  of  the  verb.  It  conslete  of  a  seriaa  of 
prepositional  phrases  which  either  modify  the  direct  object  itself  or 


structure  of  dats  in  scratchpad  storage . 
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else  Bodify  the  noun  object  of  one  of  the  other  prepositional  phrases. 

Let  us  define  the  structure  of  this  modifying  phrase  and  describe  how  it 
is  Interpreted. 

8.3U1  Determination  of  Literal  type 

The  object  of  each  preposition  can  be  a  noun  or  a  literal.  In  the 
case  of  a  literal  some  indication  must  be  given  of  its  type,  since  there 
is  no  intrinsic  difference  between  most  of  the  types  (e.g.  a  word 
literal  might  look  exactly  like  an  author  literal).  The  first  preposl- 
tion  to  the  left  of  a  literal  is  currently  used  to  determine  the  type. 
Fig.  8.12  lists  the  literal  type  which  is  assumed  to  follow  each  preposi¬ 
tion.  For  example,  any  word  not  in  the  vocabulary  that  follows  the 
preposition,  "by",  is  assumed  to  be  an  author's  name. 

The  one  exception  to  this  is  the  set  literal  which  can  be  the 
object  of  any  preposition.  It  is  distinguished  from  other  literals,  not 
by  the  preceding  preposition,  but  by  the  word,  "set",  at  the  beginning 
of  the  literal. 

There  is  one  additional  way  of  indicating  the  literal  type  which  has 
been  partially  iag>lcr-::i„ed  but  Is  not  described  in  Sec.  3.2.  This 
involves  the  use  of  a  noun  between  the  preposition  and  the  literal.  An 
exaigile  of  this  would  be  the  ph-ase,  "with  the  word,  phonon",  which  is 
ccceptable  and  identical  to  the  phrase,  "using  phonon",  a  change  such  as 
this  would  become  essential  if  the  number  of  data  types  increased  sub¬ 
stantially,  since  there  would  not  be  enough  suitable  prepositions. 
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Preposition  Type 
<prticle  preposition} 
<vord  preposition} 
^author  preposition} 
^location  preposition} 
^siting  preposition} 
^ited  by  preposition} 
<£et  preposition} 
(clustering  prepositioc} 


Type  of  Object 

Article  noun},  Citation  noun},  Article  literal} 
(word  ncun>,<word  literal} 

^uthor  noun}, Author  liters]} 
location  noui}  .(location  literal} 

Article  noun},  (citation  noun} ,  Article  liters!} 
4>rticle  noun},  Citation  noun},  Article  literal} 
(set  literal} 

Article  noun},  Citation  noun},  Article  literal} 


fig.  8.12.  Valid  types  of  objects  for  each  preposition  class. 

(Set  literals  ere  valid  objects  for  any  preposition 
and  are  not  listed.) 


8.3u2  fora  of  Literals 

After  tne  general  type  of  inform t ion  that  a  literal  contains  is 
determined,  one  auet  next  interpret  whet  specifically  is  meant  by  each 
literal.  To  this  end  let  us  describe  the  conventions  which  govern  the 
fora  that  each  type  of  literal  can  take. 

Article  literela  generally  consist  of  three  parts:  the  Journal, 
volume,  act.  page.  The  Journal  can  be  specified  by  using  the  full  title, 
the  aMadarl  aborevietlon  of  the  title,  or  s  special  alphabetic  or 
uuasrlc  code.  The  volume  and  pegs  number  can  each  consist  of  sn  integer 
or  a  word  followed  by  an  integer.  Scse  examples  of  secej  ’bie  article 
literals  are: 

Physical  .Review,  volume  128,  page  1 
Phy».  Rev.,  vol.  128,  p.  1 
Phyrev  v  128  p  1 
1  128  i 


y 
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The  volume  and  page  number  have  been  made  optional  so  that  one  can 
refer  to  all  articles  In  a  given  Journal  or  In  a  given  volume  by  a 
single  literal. 

Each  word  literal  should  consist  of  r  -<nal'  word.  If  one  wishes 
to  search  for  a  phrase  of  two  or  more  words,  he  should  use  two  or  more 
literals  {e.g.  "print  titles  of  articles  using  thin  and  film."). 

A  word  literal  represents  (matches)  not  only  the  word  in  the  file 
which  is  identical  to  it,  but  also  all  words  to  which  it  is  the  prefix. 
Thus  the  command,  "Get  the  art  using  supercon."  would  get  all  articles 
with  titles  containing  superconductor,  superconductivity,  etc. 

If  one  does  not  want  prefix  matching,  he  can  use  a  to  designate 
an  explicit  blank.  The  command,  "p  art  using  laser*.",  would  not 
produce  those  articles  whose  titles  contain  the  word,  "lasers". 

Author  literals  are  to  be  written  with  the  surname  last  (e.g. 

John  H.  Jones).  A  literal  that  consists  of  a  surname  only  will  retrieve 
ull  authors  with  that  surname.  A  literal  containing  one  or  more  given 
names  will  match  those  author  names  in  the  file  for  which  the  surname 
matches  exactly  and  for  which  every  given  name  in  the  literal  is  the 
prefix  of  the  corresponding  given  name  in  the  file.  Thus,  "p  art  by  A1 
Jones.",  would  print  all  articles  by  "Albert  JoneB,"  "Alden  Jones", 
and  "Allen  S.  Jones". 

Location  literals  must  be  given  in  a  request  exactly  as  they  sre 
found  in  the  data  file  if  retrieve!  is  to  be  accomplished. 

Set  literals  consist  of  the  word,  "set",  followed  by  the  identifica¬ 
tion  number  of  the  desired  set. 


8. 31^3  Action  Initiated  by  Each  Preposition 

Each  prepositional  phrase  in  a  request  initiates  a  file  search 
(table  look-up)  in  an  appropriate  data  file.  If  the  object  of  the 
preposition  is  an  author,  location,  word,  or  citation  literal,  then  the 
file  used  is  the  corresponding  inverted  file.  If  the  object  of  the 
phrase  is  an  article  liberal  then  the  raw  data  file  is  used. 

The  information  obtained  from  an  inverted  file  is,  of  course, 
always  a  list  of  article  identifications.  The  type  of  information 
obtained  from  the  raw  data  file  is  determined  by  the  type  of  noun  that 
is  modified  by  the  prepositional  phrase  in  question.  For  example,  in 
the  command,  "Print  authors  of  Phye.  Rev.  128  1.’’,  the  table  look-up 
for  the  "of"  preposition  would  be  in  the  rsw  data  file  and  would  select 
the  author  information. 

The  set  of  articles  (or  other  data)  produced  by  each  table  look-up 
can  in  trrn  be  the  object  of  another  preposition  and  another  table  look¬ 
up.  Consider  the  request,  "Print  the  titles  of  articles  cited  by 
articles  by  John  Jones."  The  procedure  first  looks  up  the  articles  by 
John  Jones.  Then  it  finds  the  articles  cited  by  the  articles  by  John 
Jones.  And  finally  it  retrieves  and  prints  the  titles  of  the  articles 
so  obtained.  Note  that  each  of  the  three  prepositions,  of,  (cited)  by, 
and  by  initiated  a  particular  type  of  file  search. 

There  ere  two  types  of  prepositions  that  do  not  cause  a  table  look¬ 
up  in  a  file.  A  clustering  preposition  performs  more  than  Just  a  table 
look-up.  The  procedure  of  Chapter  V  is  executed,  resulting  in  the  set 
of  articles  of  the  appropriate  cluster. 

The  set  preposition  does  not  initiate  a  file  search  but  produces 
the  input  set  as  its  output  (a  unitary  transformation).  Thus  in  the 
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request,  "Print  the  title  of  articles  in  set  in",  the  preposition,  "in", 
merely  passes  on  the  articles  in  set  it  to  the  next  preposition,  "of", 
which  looks  up  their  titles. 

8.3iti*  Logical  Operations 

The  results  of  the  table  look-ups  (or  clustering)  for  two  or  more 
prepositional  phrases  can  he  combined  by  the  standard  logical  operations 
(and,  or,  not).  Consider,  for  example,  the  request,  "Print  the  articles 
by  John  Jones  and  by  Robert  Smith  or  by  Charles  White  but  not  by  David 
Allen."  The  logical  operation  performed  can  be  represented  by  the 
equation  [  ((J.J. HR.S.  )(JC.W.  )f\D.A. ]  where  the  initials  J.J.  stand  for 
the  set  of  papers  by  John  Jones  and  D.A.  is  the  set  of  papers  not 
written  by  David  White.  It  will  be  noted  that  the  logical  operations 
are  performed  from  left  to  right  through  the  request  in  the  same 
sequence  in  which  the  user  typed  them  in.  It  was  thought  that  this 
might  be  a  more  useful  convention  for  a  system  that  is  closely  coupled 
to  the  user  than  to  have  a  parenthesized  system  with  a  hierarchy  of  the 
types  of  operations  to  perform  first  (as  in  MADjFQRTBAH,  etc.). 

Any  arbitrarily  complex  logical  structure  can  be  obtained  by  this 
kind  of  approach  (without  having  to  use  parentheses)  if  one  creates  sets 
in  scratchpad  storage.  For  example  the  set  of  articles  represented  by 
the  logical  expression,  ( J.J. f)  R-S. )  (J(C.W.  05717),  could  be  created  by 
the  sequence  of  comas nds. 

Find  ert  by  John  Jones  and  by  Robert  Smith. 

)  articles  in  sot  1. 

Find  srt  by  Charles  White  but  not  by  David  Allen. 

1  article  in  set  2. 

Print  art  in  set  1  or  in  set  2. 


i 
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There  1b  one  logical  structure  that  Is  not  allowed  in  the  system 
since  It  makes  little  sense  in  retrieval  applications,  this  is  the 
negation  of  any  of  the  operands  of  the  "or"  operation.  Consider  the 
command,  "Print  articles  by  John  Jones  or  not  by  Robert  Smith."  If 
this  means  (J.J.ljR*S.),  then  the  articles  requested  would  Include  most 
of  the  file  since  Robert  Smith  would  have  authored  at  most  20-30  articles. 

The  conjunctive  operation  between  each  pair  of  prepositional 
phrases  must  be  explicitly  stated.  One  could  not  say,  "Print  art  by 
John  Jones,  by  Robert  Smith,  and  by  Charles  White."  However,  one  can 
omit  the  prepositions  after  the  first  one  (e.g.  "Print  art  by  John  Jones 
and  Robert  Smith."). 

8.3^5  Selection  of  Predecessor 

The  next  problem  to  be  considered  la  the  determination  of  what 
noun(s)  each  prepositional  phrase  modifies  (its  predecessor).  Consider 
the  request,  "Find  the  articles  citing  articles  by  John  Jones  and  cited 
by  Physics  of  Fluids,  v,  7,  p.  1.”  The  last  phrase,  "cited  by..."  can 
conceivably  modify  either  of  the  two  preceding  "articles"  words. 

However,  the  answer  to  the  request  is  markedly  different  depending  on 
the  Interpretation  selected.  ‘Ihe  approach  adopted  here  is  to  "attach’’ 
each  prepositional  phrase  to  the  first  noun  to  the  left  of  the  phrase 
that  is  a  valid  type  for  the  preposition  in  question.  In  Fig.  8.13  the 
valid  noun  types  that  car.  be  modified  by  each  preposition  are  listed. 

Mote  that  each  preposition  that  immediately  follows  a  noun  and  not 
a  conjunction,  must  modify  that  noun  and  cannot  be  attached  to  other 
nouns  further  to  the  left.  If  the  noun  is  not  valid  for  the  preposition 
by  Fig.  3.13,  then  the  request  is  considered  in  error.  The  request, 


"Find  the  articles  by  John  Jones  and  the  citations  at  harvard  University.", 
would  not  be  valid  because  the  preposition,  "at",  is  not  a  valid  modifier 
of  "citations"  and  cannot  be  attached  to  the  earlier  "articles"  word 
because  it  does  not  immediately  follow  a  conjunction. 


Modifiable  Noun  Types 
(noun) 

Article  noun),  (:  its  tion  noux) 
Article  nouri>, (citation  noun) 
Article  noun), (citation  nouri> 
<prticle  noun),  {citation  noun) 
Article  noui),  (citation  nour) 
(noun) 

(article  noun),  (citation  nour) 


Prepo8ltion  ljype 
particle  preposition) 
(word  preposition) 
(author  preposition) 
(location  preposition) 
(citing  preposition) 
(cited  by  preposition) 
(set  preposition) 
(clustering  preposition) 


Fig.  8.13.  Types  of  nouns  that  each  claes  of  prepositions 
can  modify. 


8.3^6  Interpretation  of  Adjectives 

Let  us  make  two  final  comments  concerning  the  interpretation  of  the 
language.  Filler  words  are  adjectives,  adverbs  and  certain  other  words 
that  initiate  no  action  in  the  interpreter.  They  are  effectively  ignored. 
Their  only  uee  is  to  make  the  statement  of  the  request  more  smooth  and 
natural. 

There  are  other  adjectives  and  adverbs  that  do  effect  the  inter¬ 
preter.  however.  Some  of  them  are  listed  in  Fig.  8.7.  A  large  number  of 
adjectivea  and  adverbs  come  to  mind  that  would  be  very  useful  if  imple¬ 
mented.  However  only  enough  of  them  were  made  part  of  the  experimental 
system  so  the  possibility  of  their  use  in  the  lsnguege  could  be  tested. 


PART  FOUR:  RESULTS  AMD  CONCLUSIOHS 


Part  Two  introduced  a  theoretical  model  for  a 
document  retrieval  system.  Tue  experimental  system 
developed  to  test  the  model  in  a  realistic  environ¬ 
ment  was  described  in  Part  Three.  In  this  part  we 
present  the  experimental  results  obtained  with  the 
system  and  the  conclusions  about  the  model  that  can 
be  drawn  from  them. 

This  final  part  is  divided  into  two  chapters. 

Chapter  IX:  Experimental  Results 


Chapter  X:  Conclusions 
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CHAPTER  IX 
EXPERIMENTAL  RESULTS 

In  the  first  section  of  this  chspter  so^e  data  on  the  general 
characteristics  of  clusters  will  be  presented.  Then  some  specific 
examples  will  be  given  illustrating  the  composition  of  clusters  in 
terms  of  the  frequency  of  occurrence  of  title  words,  authors,  and 
citations  of  the  Included  articles. 

In  the  next  two  sections  clusters  will  be  compared  with  some 
existing  sets  of  documents  which  have  already  been  Judged  to  be 
mutually  pertinent.  Three  bibliographies  found  in  review  articles  that 
are  not  part  of  -he  T.I.P.  file  and  two  subject  categories  compiled  by 
indexers  will  be  used  for  this  purpose. 

Finally,  the  results  of  two  tests  will  be  presented  in  which 
clusters  were  evaluated  by  representative  users  of  the  document  file. 

9.1  Cluster  Parameters 

Before  attaching  the  problem  of  whether  or  not  clusters  contain 
sets  of  documents  that  are  mutually  interesting  to  users,  it  may  be 
appropriate  to  first  summarise  same  of  the  more  general  features  of 
clusters.  This  section  will,  accordingly,  present  statistics  on  certain 
cluster  parameters. 

The  dete  from  which  the  statistics  are  drswn  come  from  the  tests  of 
Sec.'s  9.)  to  9.5.  They  are,  of  course,  s  function  of  the  particular 
requests  presented  to  the  system  during  the  tests  and  of  the  composition 
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of  the  T.l.P.  file  at  the  time.  It  was  thought,  however,  that  this 
would  serve  as  an  Introduction  to  the  experimental  results. 

The  first  parameter  that  will  he  described  is  cluster  size.  Pig. 
9.1  shows  the  distribution  by  size  of  some  different  clusters  generated 
by  the  procedure.  The  largest  cluster  found  so  far  contains  159  docu¬ 
ments,  while  the  smallest  contains  only  one  document. 


Number  of  Clusters 


Pig.  9.1.  Distribution  of  cluster  size  for  U90  clusters. 

One  of  the  important  features  of  the  clustering  procedure  as 
described  in  Chapter  V  is  its  ability  to  adjust  the  size  of  the  answer 
to  fit  the  request.  This  is  accomplished  by  applying  a  bias  to  the 
links  of  the  document  network  (See  Sec.  h.U).  About  82^  of  the  clusters 
examined  utilized  either  a  positive  or  negative  bias  with  the  other  18  i 
having  no  (zero)  bias. 

In  Pig.  9.2  the  distribution  of  clusters  for  various  ranges  of  bias 
Is  shown.  Pig.  9. j  Indicates  that  the  average  cluster  Use  Increases 
monotonically  as  the  bias  Increases.  This  curve  seems  to  follow  the 
equation  y^-80(x-12)  where  y  Is  the  cluster  size  and  *  Is  'he  bias.  We 
will  not  attempt  to  explain  why  this  la  the  ease  here. 
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Humber  of  Clusters 


Average  Cluster  Size 


Another  characteristic  of  the  procedure  that  can  be  studied  is  the 
way  documents  are  deleted  from  the  set  (S)  that  is  being  formed.  The 
formation  of  37  clusters  was  observed.  It  was  found  that  an  average  of 
three  documents  vere  deleted  per  cluster.  This  resulted  in  an  average 
deletion  of  one  document  in  every  1$  Iterations.  It  was  axso  found  that 
stout  90  t  of  the  documents  that  were  deleted  from  S  were  added  to  S 
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some  later  time  during  the  clustering. 

Let  us  next  ask  when  during  the  clustering  process  deletions  occur. 
Fig.  9.1j  indicates  that  deletions  are  more  likely  to  occur  toward  the 
end  of  the  clustering  process. 

Percent  of  deleted 
documents  in  each 


Fig.  9-U.  Percent  of  deletions  occurring  in  each  quartile  of 
the  clustering  process. 

(average  for  7?  clusters) 

In  the  final  portion  of  this  section  we  will  describe  the  way  the 
procedure  responds  to  requests  that  are  inconsistent  or  ambiguous.  A 
specific  example,  (Cluster  of  Sec.  9-33)  is  used  for  this  purpose. 

The  first  test  consisted  of  holding  the  pertinent  (Y)  set  of  the  request 
constant  and  in  successively  placing  every  other  member  of  the  Cluster  A 
in  the  non-pertinent  (Z)  set  (y»a^;  t-a^  1-1,,.. ,n).  The  results  are 
shown  in  Fig.  9.5  *nd  9-6. 

There  are  three  basic  types  of  responses  that  resulted.  In  seven 
tests  the  site  of  the  Cluster  was  reduced.  This  was,  in  general,  what 
happened  when  the  document  specified  es  uwt  pertinent  had  a  smaller  bias 
to  A  than  a^  did.  In  eight  other  caaea  the  procedure  was  fjund  to 
select  toother  cluster  (B,D,  or  C)  containing  some  documents  that  were 


not  part  of  the  original  cluster.  In  the  remaining  twelve  cases  the 
request  was  judged  to  be  inconsistent.  A  careful  examination  of  the 
network  revealed  that  in  each  of  the  twelve  cases  there  was  at  lea./ 
one  cluster  which  could  have  satisfied  the  request.  The  reasons  why 
the  procedure  was  not  able  to  locate  a  valid  answer  cluster  in  these 
cases  have  already  been  discussed  ir.  Sec.  5.5l. 

Fig.'s  9.5  and  9.6  illustrate  two  types  of  request  ambiguity.  The 
first  type  is  hiersrchal  in  nature  involving  dust  rr  that  are  subsets 
of  larger  clusters.  Take,  for  example,  the  requsit,  Y*a;  Z'a^y.  It 
can  be  satisfied  not  only  by  the  cluster  listed  for  it  in  Fig.  9*5,  but 
also  by  the  smaller  clusters  listed  for  a^,  and  a^.  The  second 

type  of  ambiguity  is  due  tc  the  fact  that  clusters  overlap.  Thus  the 
clusters  B,  D,  or  E  also  satisfy  the  request  Y^a^jZ^y. 

A  second  test  was  conducted  in  order  to  further  study  the  extent  of 
the  second  type  of  ambiguity.  In  this  test  a  giver,  document  was  speci¬ 
fied  as  pertinent  and  a  cluster  was  found.  The  document  which  had  the 
highest  correlation  to  the  cluster  found  was  then  specified  as  non¬ 
pertinent  and  another  search  was  conducted.  If  a  second  cluster  wax 
found  then  the  document  with  the  highest  correlation  to  the  new  cluster 
was  added  to  f,  and  the  process  was  continued.  At  some  point  the  request 
became  inc  nslstent. 

The  results  of  this  type  of  test  on  six  urtieles  is  giver,  in 
Fig.  9.7.  Sole  that  do-unent  o.  Fig.  ?•'  would  result  In  the  test 
pattern  of  Example  h  since  Is  most  highly  correlated  to  A  and  the 
tns-er  to  the  request.  (T”a  ^ )  is  inconsistent 


Articles  In  Bias  of 
Cluster  (a)  tj  to  A 

Rank  by  Mas 
(largest  first) 

Answer  to  the  Request: 
Y-a,  J  Z-a< 

al 

lUi.9  bits 

20 

Inconsistent 

“2 

132.7 

5 

B 

*3 

121.0 

15 

Inconsistent 

•u 

130.3 

8 

Inconsistent 

*5 

103.2 

26 

aH-5 

*6 

118  .k 

16 

B 

*7 

116.3 

17 

An(a5a6«7ai0a12aual6al3 

a8 

131.9 

6 

Inconsistent 

a9 

123.2 

13 

Inconsistent 

°10 

109.8 

23 

AH(i*5«io8  >a15“jL6*l8) 

all 

127 .!» 

9 

Inconsistent 

*12 

lOli. 6 

25 

a13 

136.6 

1 

inconsistent 

“Hi 

126.1 

11 

Inconsistent 

al5 

110. 1j 

22 

0 

al6 

102.8 

27 

aH(^) 

“l7 

122.0 

u 

B 

al8 

1C6.6 

21 

An(a5*12*l6*l8) 

“19 

116.2 

18 

E 

*20 

112.3 

21 

A^'*5*10*12*l5*l6*l8*20^ 

a  . 

t  i. 

lli6.1i 

2 

E 

*22 

12 

Inconsistent 

*23 

155.6 

1 

Inconsistent 

*2u 

111. 8 

) 

Inconsistent 

"25 

115.1 

19 

E 

*26 

130. U 

i 

Inconsistent 

*27 

127.0 

10 

E 

B'(*l*J*i6*l3*2o) 

plus  12 

other 

articles 

>(V2\Vl7*20 

)  plus  11 

.  other 

articles 

B *(*1*2*i0)  ?iu* 

20  other 

articles 

Fig.  9-5.  Enable  of  elu avers  which  result  *oen  documents 
•  r«  specified  a*  noc-pertinent. 
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Pig.  9.6.  Diagram  of  relationship  of  clusters  of  Pig.  9.5. 
(Each  circle  represents  a  cluster) 


Example  Size  of  successive  answer  clusters 


-l 

31,  22,  27,  inconsistent 

2 

17,  125,  1.,  2,  inconsistent 

$ 

3 

22,  36,  23,  23,  inconsistent 

1 

ii 

27,  inconsistent 

5 

33,  27,  inconsistent 

6 

39,  33,  lit,  inconsistent 

Pig.  9.7. 

Test  cf  request  ambiguity. 

f 
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9.2  Cluster  Composition 

In  the  i"«<t  section  statistics  on  some  of  the  more  general  features 
of  clusters  such  as  size  ••nd  bias  were  presented.  In  this  section  the 
composition  of  clusters  will  be  described  in  terms  of  data  available 
in  the  T.I.P.  file.  In  particular,  examples  will  be  given  of  the 
composition  of  clusters  in  terms  of  the  title  words,  authors,  and 
citations  of  the  included  articles. 

In  Fig.  9.8  we  list  in  order  of  frequency  of  occurrence  the  title 
words  for  six  clusters.  Note  that  the  common  "Tmction"  words  (in,  of, 
the,  and,  on,  etc.)  have  been  omitted  from  all  of  the  lists  except  for 
Example  A.  Also  the  lists  have  been  truncated  to  include  only  the  words 
that  occurred  most  o^ten  in  the  titles.  The  full  titles  of  Example  B 
are  shown  in  Fig.  9-1&- 

In  none  of  the  cases  studied  did  the  title  of  every  article  in  a 
cluster  contain  the  same  word.  For  Fig.  9.8  the  word  that  comes  closest 
to  occurring  in  every  title  is  "plasma"  of  Example  D,  which  occurs  in 
18/22*82  of  the  titles.  If  one  were  to  group  together  word3  of  equiv¬ 
alent  meaning,  then  "superconducting"  and  "superconductors"  in  Example  A 
would  be  highest  with  27/31“88  . 

In  Fig.  9.9  6ome  similar  examples  are  given  for  the  autho’ s  of  the 
articles  in  clusters.  In  Example  A  it  was  found  that  E.  Sch.omann  is 
the  author  of  two  other  papers  in  the  T.I.P.  file  (in  addition  to  the 
four  listed),  R.  I.  Joseph  of  one  other,  and  W.  Strv  '0  of  two  others. 

In  Fig.  9.10  citation  counts  are  given  for  the  same  three  clusters 
that  were  used  in  Fig.  9.9.  In  Example  A  there  is  one  citation  which 
is  found  in  all  of  the  articles  in  the  cluster.  In  Example  B,  it6/6h-72^ 
of  the  articles  cite  the  s am?  paper,  while  only  10/ 35*28  %  do  in  Example 
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Example  A 

Example  B 

Example  C 

Cluster  A-  of 

Cluster  A1  of 

Cluster  A-  of 

Sec.  9-331 

Sec.  9.317 

Sec 

9*33. 

31  articles 

12 

articles 

22 

articles 

99  words 

66  words 

75 

words 

22  in 

7 

waves 

12 

quantum 

22  superconducting 

5 

spin 

11 

oscillations 

19  of 

3 

garnet 

8 

ultrasonic 

13  ultrasonic 

3 

iron 

6 

attenuation 

10  energy 

3 

magnetic 

6 

field 

10  gap 

3 

magneto-elastic 

6 

giant 

9  the 

3 

microwave 

6 

metals 

8  attenuation 

3 

nonuniform 

5 

effect 

5  and 

3 

propagation 

h 

magnetic 

5  superconductors 

3 

yttrium 

h 

magnetoacoustic 

5  tin 

2 

crystal 

3 

absorption 

U  by 

, 

3 

sound 

It  determination 

• 

2 

alphen 

U  waves 

• 

• 

3  (ll  words) 

• 

2  (l6  words) 

• 

1  (58  words) 

Example  D 

Example  E 

Example  F 

Cluster  Ao  of 

Cluster  A  of 

Sec.  9.51. 

Cluster  for  article 

Sec.  9.52. 

8  of  Fig.  9.11 

22  articles 

iiO  articles 

2?  articles 

8ii  words 

l5t  words 

8l  words 

18  plasma 

20  plasma 

16  optical 

9  turbulent 

17  probe 

7  generation 

8  waves 

11  langmuir 

7  harmonic 

5  particles 

9  rrobes 

6  nonlinear 

h  electromagnetic 

5  characteristics 

5  theory 

li  turbulence 

5  field 

3  second 

3  charged 

5  magnetic 

• 

li  electrostatic 

, 

• 

Li  resonance 

• 

i»  studies 

3  double 

Fig.  9.8.  Title-word  frequency  counts  for  six  clusters. 

(The  number  to  the  left  of  each  word  is  the  number 
of  times  it  occurs  in  the  titles  of  the  cluster.) 


•  - 
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Example  A 

Cluster  of 
Sec.  9.31. 

12  articles 

13  authors 


Example  B 

Cluster  A,  of 
Sec.  9.32. 

6k  articles 
75  authors 


Example  C 

Cluster  A-  of 
Sec.  9.52'* 

35  articles 

36  authors 


h  Schlomann  Ernst 
3  Joseph  R.  I. 

2  Damon  R.  W. 

2  Strauss  W. 

2  Van  De  Vaart  H. 
1  (8  authors ) 


7  Spector  Harold  N. 
1*  Prohofsky  E.  W. 

3  Gurevich  V.  L. 

3  Kroger  Harry 
3  Pustovoit  V.  I. 

2  (8  authors) 

1  (62  authors) 


7  Kraichnan  Robert  H. 

2  Deissler  Robert  G. 

2  Eschenroeder  Allan  Q. 
1  (35  authors) 


Fig-  9*9«  Author  frequency  counts  for  three  clusters. 


Example  A 

Example  B 

Example  C 

Cluster  A.  of 

Cluster  A,  of 

Cluster  A,-  of 

Sec.  9.317 

Sec.  9*32. 

Sec.  9.527 

12 

articles 

61*  articles 

35  articles 

35 

citations 

369  citations 

195  citations 

12 

11-31-1298 

1*6  1*1-7-23? 

10  802-5-a97 

7 

la-8 -357 

31  11-33-21*57 

6  227-2-12a 

6 

11-35-159 

29  1*1-9-87 

5  8-30-301 

k 

11-35-167 

22  11-33-1*0 

5  799-7-1030 

3 

1-105-390 

19  11-3^-151*8 

5  802 -12 -2a2 

3 

1-120-2001* 

19  ai-9-296 

5  802-13-369 

3 

11-35-1022 

18  1-127 -1081 

5  802-16-33 

2 

1-125-1950 

lu  l-126-19?a 

a  (3  citations) 

2 

11-31-161*7 

ia  ai-8-a 

3  (13  citations) 

2 

11-35-2382 

10  !*i-a-5o5 

2  (33  citations) 

2 

11-35-2382 

9  i -13a -1302 

1  (139  citations) 

2 

11-36-875 

9  23-8-161 

2 

1*1-6-620 

7  (a  citations) 

2 

1*1-12-583 

6  (7  citations) 

2 

708-19-308 

5  (l2  citations) 

1 

(21  citations) 

a  (12  citations) 

3  (l8  citations) 

2  (a9  citations) 

1  (262  citations) 

Fig.  9.10.  Citation  frequency  counts  for  three  clusters. 
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C.  Example  C  Is  an  illustration  of  an  area  where  all  of  the  articles 
do  not  cite  one  central  paper  and  yet  through  the  use  of  a  large 
positive  bias  they  can  be  pulled  together  into  a  cluster. 

<Ihe  papers  listed  in  Fig.  9. 10  are  identified  by  three  numbers: 

The  Journal  code  (see  Fig.  6.3),  volume,  and  page  number.  Thus 
1-136-libl  is  the  paper  beginning  on  page  liljl  in  volume  136  of  the 
Physical  Review. 

9.3  Comparison  to  Bibliographies 

The  next  test  will  be  to  compare  the  bibliographies  found  in  certain 
papers  with  clusters  formed  by  the  procedure.  Consider,  for  example,  a 
paper  with  20  citations.  It  would  be  of  interest  to  know  if  a  cluster 
can  be  formed  which  includes  most,  if  not  all,  of  the  20  citations. 

For  this  purpose  three  articles  were  selected  from  the  special 
October  1965  issue  of  the  IEEE  Proceedings  on  ultrasonics.  It  was 
decided  that  these  articles  which  are  not  part  of  the  T.I.P.  file  would 
insure  some  degree  of  independence  between  the  data  base  and  evaluation 
criteria.  Hie  IEEE  Proceedings  represented  a  Journal  which  is  closely 
related  to  the  T.I.P.  physics  file  and  yet  is  not  actually  part  of  the 
file.  Since  the  T.I.P.  file  covers  only  the  last  three  years,  a  recent 
issue  of  the  IEEE  Proceedings  was  needed  if  a  suitable  fraction  of  the 
bibliographies  of  the  evaluating  papers  were  to  be  found  in  the  T.I.P. 
file. 

Of  the  twenty-seven  articles  in  the  October  IEEE  Proceedings,  only 
ten  cite  ten  or  more  articles  in  the  T.I.P.  file.  Fig.  9.11  tabulates 
these  ten  papers.  For  the  three  articles  to  be  used  in  evaluating  the 
clustering  procedure  we  selected  the  two  papers  with  the  highest  percent 
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of  their  bibliographies  in  the  T.I.P.  flic  (l  and  2)  and  the  paper  with 
the  moat  references  to  the  T.I.P.  i . (/). 


Citations  Percent  of 


Articles  in  Proc.  Total  to  T.I.P. 

IEEE  Vol.  53 _  Citations  file _ 


Bibliography 
in  T.I.P.  file 


1. 

pp. 

11*95-1507 

22 

10 

1*6* 

2. 

pp. 

1U52  -H16U 

3d 

16 

1*2 

3. 

pp. 

1517-1533 

58 

22 

38 

u. 

pp. 

ll*38-H*5l 

86 

32 

37 

5. 

pp> 

1508-1517 

1*7 

17 

36 

6. 

pp* 

1320-1336 

33 

11 

33 

7. 

pp. 

1586-1603 

128 

36 

28 

0. 

pp. 

160U-1623 

67 

18 

27 

9. 

pp. 

1387-1399 

56 

13 

23 

10. 

pp. 

151*7-1573 

101 

15 

15 

Fig.  9.11.  Articles  in  the  October  1965  Issue  of  the  tect 
Proceedings  that  have  10  or  more  references  to 
the  T.I.P.  file. 


9.31  Bibliography  1  (IEEE  Proc.,v.  53 ,  p.  11*95) 

From  Fig.  9*11  we  note  that  the  article  beginning  on  page  U*95 
has  22  citations,  10  of  which  are  to  articles  in  the  T.I.P.  file. 

Fig.  9  .12  lists  the  10  articles  as  set  B  and  also  lists  some  other 
sets  of  papers  that  will  be  found  useful  in  the  discussion  that 
f  'liows.  The  ith  document  in  set  B  will  be  referred  to  as  b^etc. 

The  answer  clusters  obtained  by  the  procedure  for  18  different 
requests  are  tabulated  in  Fig.  9*13.  The  symbol  A[Y(bi)Z(bj )]  stands 
for  the  answer  cluster  with  b^  specified  as  interesting  and  bj 
specified  6S  not  interesting  (i.e.  Y1^),  Z-(bj)). 


T 
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B 


I- 136-1*1*2 
*1-35-159 

II- 35-167 
11-35-1022 
11-36-108 
11-36-12143 
11-36-1267 
11-36-1579 
Ul-12-583 
61*  6-5-33 


D 


11-36-12145 

11-36-31402 


E 


11-36-31*53 

6146-5-176 


f 


11-36-21426 

11-36-3599 

141-12-325 

6L6-6-18 


Q 

i-ijo-6i*7 

11-35-836 

11-35-993 

11-36-661 

11-36-181*5 


H 


1-129-991 
1-130-1439 
l-13l*-172 
1-13-U  -1*07 
1-136-1657 

I- 137-182 

II- 314-1629 
11-314-2639 
11-36-2387 
11-36-3102 
I4I-II-69 
I4I-H-69 
I*l-li4-25l4 
149-14-129 
310-7-1892 
11*6-2-38 
669-16-1*10 
669-18-235 
790-8 -591* 


Fig.  9  .12.  The  sets  of  articles  included  in  the 
clusters  for  Bibliography  1. 


Answers  to  Selected  Requests: 
A[Y(b1)]»A1  for  i-2. ..5,7,8,10 
A[Y(t>1)  J“A^ 

A[Y(b6)]«A2 

A[Y(b9)]«A3 


Definitions  of  Clusters: 


AtY<b9),A(i*ii)]-A;L 

AfY^bZ^)^ 

AtYfb^)]^ 

A[Y(b.b  )J«A JJF  plus  5  members  of  H 
1  and  50  other  articles 
AtY(b2...b10)J-A2 

A[Y(b1...b10)]-A2gAj 


A^“(b2 . .  .bj,b^,bg,b^0)U0UE  A^b^UEUH 

A^UCb^U?  V(bl)UQ 


Kig.  9.13.  List  of  the  answer  clusters  formed  for  Bibliography  1. 


in  Fig.  9. lit  the  probable  answers  for  requests  consisting  of  other 
combinations  of  b's  are  suggested.  All  of  the  requests  listed  in  this 
figure  have  not  been  actually  tested,  but  experience  with  the  clustering 
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procedure  and  the  results  of  Fig.  9*13  make  it  appear  reasonably  safe 
to  assume  that  the  conclusions  are  correct. 

A[Y(bibJ)]-A1  for  i,J-2 .  ..5,7. .  .10  (i/j) 

A[Y(b6b1)]-A2  for  1-2... 10 

AtYtb^b^)]-  (large  set  of  70-100  articles)  for  1-2...10 
A(Y(b9)Z(hi)]*A1  for  1-1...18 

A[Y(Any  combination  of  b,, . .  .b^,b^. .  .b^J-A^ 

A[Y(b4  plu8  any  combination  of  b„. .  .b,  _ )  ] «A^, 

0  c  XU  c 

A[Y(b^  plus  any  combination  of  other  b's)»(large  set  of  70-100  articles) 
Fig.  9*lh.  Generalizations  suggested  by  the  results  of  Fig.  9.11. 


A  diagram  showing  the  amount  of  overlap  of  the  various  answer 
clusters  ia  shown  in  Fig.  9.15. 


Fig.  9.15.  Sketch  showing  the  relationship  of  the 
answer  clusters  of  Bibliography  1. 


~T 
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Some  comments  will  now  be  nade  concerning  the  results  given  In 
Fig.'s  9.12  •  9.15.  When  the  request  consists  of  a  single  member  of 
the  bibliography,  the  same  answer  results  in  7  out  of  10  cases.  This 
cluster,  A^,  contains  3  of  the  10  articles  In  the  bibliography  (b^  and 
b^  are  omitted). 

The  article  b^  is  included  in  but  does  not  result  In  A^  when 
used  as  a  request.  It  results  in  an  almost  completely  different  set  of 
documents  (A^)  which  contains  only  one  member  of  the  bibliography.  The 
request  Y(b^)  is,  therefore,  ambiguous  with  either  or  A^  being  a 
valid  answer.  To  resolve  the  ambiguity  various  documents  from  the  set 
H  were  placed  In  the  non-pertinent  set  Z.  This  shifted  the  answer  from 
to  A^.  It  was  found  that  the  ambiguity  could  also  be  resolved  by 
placing  an  additional  document  in  the  Y  set.  Thus  a  request  of  Y(b2b^) 
also  resulted  in  the  answer  A^. 

The  cluster  A^  exemplifies  another  type  of  ambiguity.  The  set  A^ 
la  a  subset  of  the  set  an5*  thus  the  requests  Y(bi)  where  1*2... 5, 7, 
8,10,  could  be  satisfied  by  either  A^  or  A2-  The  request  Y(b^)  can 
only  be  satisfied  by  A^,  however,  since  b^  is  not  included  in  A  .  Thus 
the  article  b^  is  slightly  " beyond"  the  cluster  A^  and  if  used  in  the  Y 
set  of  the  request  results  in  more  general  cluster  A^  of  17  documents 
instead  of  the  cluster  A^  of  12  documents.  Note  that  both  requests  of 
the  form  Y^b^)  with  1*2. . .10  and  the  larger  request  Y^.-.b^) 
result  in  the  cluster  A.^. 

The  only  article  from  Bibliography  1  which  is  not  included  in 
is  b^.  The  request  Y( )  results  in  the  cluster  which  is  disjoint 
from  any  of  the  clusters  discussed  so  far.  When  requests  of  the  '"orm 
Y(b1b1)  1*2... 10  are  used,  very  large  clusters  result  including  most 
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i  the  documents  listed  in  Fig.  9*12  and  many  more.  A  check  of  the 

paper  from  which  Bibliography  1  was  taken  reveals  that  b1  is  cited 
only  as  a  source  for  the  values  of  some  constants.  It  is  suggested 
that  this  may  be  the  reason  it  does  not  fit  into  the  closely-related 
cluster  A,^  which  includes  the  other  nine  papers. 

One  final  observation  will  be  made.  There  are  four  articles  in 
Ap  and  nine  in  A^  that  are  not  part  of  the  original  bibliography. 

The  question  of  whether  these  papers  constitute  valid  additions  to  the 
bibliography  will  t>e  discussed  in  Chapter  X.  Let  us  at  this  point, 
however,  present  the  titles  of  the  papers  in  A^  (Fig.  9.16)  as  an 
illustration  of  the  type  of  additional  articles  included  in  the 
clusters. 

9.32  Bibliography  2  (IEEE  Proc.,  v.  53,  p.  liu$2 ) 

In  Fig.'s  9.17  -  9.20  we  present  the  same  data  for  Bibliography  2 

that  were  given  for  Bibliography  1.  Here  again  a  large  majority  of 

the  documents  (ll  of  16)  in  the  bibliography  lead  to  the  seme  cluster 

(A  )  when  specified  as  interesting  in  the  request. 

From  Fig.  9*20  we  observe  that  clusters  A^,...,A^  fora  a  hierarchal 

series  of  increasingly  larger  sets  with  each  new  set  including  the 

previous  set.  The  set  A,  contains  ll  of  16  members  of  the  bibliography 

u 

and  50  other  documents.  The  set  A^  is  the  only  set  in  the  series  that 
has  0  bi*a.  The  series  can,  of  course,  be  extended  to  sets  which  are 
larger  than  A^  or  to  subsets  of  A^  by  additional  changes  in  the  bias. 

There  are  two  member*  •  i  the  bibliography  (b^  and  fc^j)  that  do  not 
fit  into  the  pattern  set  by  the  other  ll  members.  The  article  b&  haa 
no  poaltlvc  connection  to  any  other  paper  (l.e.  none  of  the  paper*  it 
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Print  the  titlei  of  the  articles  related  to  J  Appl  Phys  v.  35  p.  159. 
12  docuaents  in  set  1. 

Journal  of  Applied  Physics,  Volume  35,  page  159* 

Generation  of  spin  waves  in  nonuniform  magnetic  fields  I. 
Conversion  of  electromagnetic  power  into  spin-wave  power  and 
vice  versa. 

Page  167 

Generation  of  spin  waves  in  nonuniform  magnetic  fields  II. 
Calculation  of  coupling  strength 

Page  1022 

Magneto-elastic  waves  in  yttrium  iron  garnet 
Volume  36,  page  118 

Magneto-elastic  waves  in  yttrium  iron  garnet 
♦Page  12U5 

Electronically  variable  delay  of  microwave  pulse r 
single-crystal  YIG  rods 

Page  1267 

Microwave  magneto-elastic  resonances  in  a  nonuniform  magnetic 
field 

Page  1579 

Demagnetizing  field  in  nonelllpsoldal  bodies 
*  Page  3U02 

Anisotropic  spin- -.ve  propagation  in  ferrites 
*Page  3153 

Propagation  of  magnetostatic  spin  waves  at  microwave 
frequencies  in  a  normally-magnetized  disc 

Fnyaical  Review  Letters,  Volume  12,  page  583 

Dispersion  of  long-wavelength  spin  waves  from  pulse-echo 
experiments  1 

Applied  Physics  Letters,  Volume  5,  page  33 

Propagation,  dispersion,  and  attenuation  of  backward -traveling 
magneto-elastic  waves  in  YIQ 

♦Page  176 

•tall  effects  in  single-crystal  spheres  of  Yttrium  iron  garnet 

(no) 

aud.  9*6  sec.  used. 

Pig.  9.16.  Titles  of  articles  in  the  A.  cluster. 

(The  four  *  ertlcles  were  not  part  of  the 
original  bibliography . ) 
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B 

D 

D  (Con' t. ) 

£ 

H  (Con't.) 

l-13li-1302 

1-129-1009 

L9-U-I9L 

ld-ll-706 

1-135-51 

1-135-1761 

1-130-910 

L9-13-285 

310-6-2233 

1-135-16:2 

1-136-772 

1-131-1087 

L9-17-1JU 

1-137-801 

1-136-1731 

1-131-2512 

80-19-671* 

r 

1-137-1305 

1-138-1721 

1-132-522 

80-20-1131 

00 jl 

1-138-5  iU 

11-35-125 

1-132-679 

80-30-ll(2l( 

G 

1-133-1559 

11-36-528 

l-13l(-50? 

80-20-161(7 

r-iB-w 

1-139-539 

Ul-11-2146 

1-135-1388 

80-20-191(0 

1(1-12-21(1 

1-. 1*0-2110 

1*1-12  -l>7 

1-137-311 

80-20-216C 

(9-19-268 

*.-lli2-126 

1(1-12-555 

1-138-1250 

310-5-1318 

310-6-21(73 

3-82-1*01 

1(1-13-1(31( 

1-139-19^9 

310-7-688 

61(6-7-1(5 

3-86-709 

l(i-ll(-372 

3-81-130 

3814-32-100 

61(6-7-62 

11-36-22 

6L6-I4-82 

11-35-137 

612 -3-ltiid 

11-36-3281 

61(6-1-190 

11-35-11(83 

612-3-698 

12-39-11*93 

61(6-1(-212 

11-36-3728 

669-16-383 

21-30-17 i7 

11(6-6-81 

23 -31-1700 
29-30-119 

669-I6-I612 

669-19-2L2 

i-jUi-yp 

I-131-JUU69 

21-30-1617 

1*1-11-11* 

29-31-957 

669 -19 -11(07 

1(1-11-11(6 

Ll-13-308 
1(3-3  7-515 
1(9-1* -1*5 

669-12-1113 

821-2-11(9 

1-131*  -728 

1-131**1313 

l-131(-lii29 

80-20-363 

u*9-21-103i* 

821-2-11*1 

Fig.  9»17*  The  sets  of  articles  included  in  the  clusters 
for  Bibliography  2. 


Answers  to  Selected  Requests: 
AlY(b  )!-A  i-1,2,3,5,1,^9, 

11,12,114,10 

A[Y(b10)]-A2 
A [Y(ta)]-A3 
A[  Y(k>1g )  ] -A^ 

A(Y(b6)]-(t6) 

A(Y(b13)]-A5 

AtYtb^)]^ 

Definitions  of  Clusters: 

V(blb2b3bSb7b8b9bllVil*bl«>> 

B2  *B1 ^  b  i  0 

VW 

YBjU*l5 


A[Y(bl5b.6,]‘Au 

A[Y(bub^)!.Au 

A(Y(fcl'>lj)]«ALyb13y(29  others) 

A[Y(bU1...b5b7...b12tlli...b,6)).Ali 

A[Y(blli)2(d22)).A? 

A[Y(blli)Z(Cj).A;;n(h9hnh1,h.9h^b;) 

A[Y(bu)Z(bibiJ)).(b3b9butiu)U 

A^UO 
a2^2Udue 
Aj'BjU  DUK  UF 

V^^UO)0 

A5*(t,3b13°iJ^H 


Fig.  9. id.  List  of  the  an».er  clusters  forswd  for 
Bibliography  2. 
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A[Y(bibj)]«A1 

A[T(b,0bi)]«A2 

AtY(blib1)]-A3 

A[Y(bl5bi)]-Au 
AtYO^b^)]-  Inconsistent 

AtY^13bi)]“Al:Ubi3  o^ers) 
AtKXjH^ 

All'b^))-^ 

AiY(bliX2)]-A3 

A[-'bl5X3)]'\ 


for  bi,bJCB1 
for 

for  b.CB2 
for  t>jCB3 

(b^  Is  not  linked  to  any  other  paper.) 

for  b£Z  B3 

for  XJCB1 

for  X1CB1 

for  X2CB2 

for  X,  B. 


Fig.  9-ly.  Generalizations  suggested  by  the  results  of  Fig.  9-19- 


Fig.  9  .20.  Relationship  of  answer  clusters  of  Bibliograpoy  2. 
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cites  are  cited  by  other  papers)  and  la  thus  isolated  from  the  rest  of 

wms 

the  file.  Article  b^  can  be  Included  In  a  cluster  with  the  rest  of 
the  papers  if  the  bias  is  made  large  enough.  The  duster  Aflfb^b^)] 
contains,  for  ex&sple,  all  of  the  bibliography  except  bg. 

There  is  one  significant  characteristic  that  the  five  papers  not 
included  in  have.  They  all  have  relatively  few  citations.  Articles 
and  have  only  two  citations  each.  Articles  b^  and  b^  have 
only  three.  Article  b^  has  seven.  In  contrast  the  bibliography 
articles  in  all  have  seven  or  more  citations  except  b^  and  b^ 
which  have  five  each.  It  is  suggested  that  perhaps  the  reason  b^  and 
are  not  included  in  the  cluster  is  that  they  have  insufficient 
references  to  position  them  properly  in  the  r.  -twork. 

9.33  Bibliography  3  (IEEE  Prcc. ,  v.  53,  p.  1586) 

In  Fig.'s  9.21  to  9.2U  the  data  for  bib  .lography  3  is  presented. 

The  paper  from  which  this  bibliography  is  taken  has  four  sections 
(I, II, III, IV)  with  section  III  haveing  four  subsections  (III  A,  B,  C,  D). 
The  particular  section  vend  subsection)  in  which  each  bibliographic 
item  is  first  cited  is  noted  in  Fig.  9-21.  These  section  numbers  ere 
also  noted  over  the  symbols  for  the  documents  in  Fig.  9.23.  Some  of 
the  documents  in  Fig.  9.23  ire  inclosed  in  parenthesis.  Ibis  is  to 
indicate  that  the  document  has  already  appeared  elsewhere  in  the 
diagram. 

From  Fig.  9.23  we  note  that  a  hierarchal  series  of  clusters  (A^  to 
A^)  similar  to  the  one  in  Fig.  9.20  is  formed  by  13  of  the  documents 
of  Sec.  III.  A  similar  but  separate  series  (A^  to  Ag )  is  formed  by  the 
documents  of  Sec.  IV.  There  also  appears  to  be  a  separation  of  the 
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B 

E 

K 

M  (Con't.) 

1-129-12 

IIIA 

1-129-1990 

1-131-73 

80-18-1569 

1-129-18 

me 

1-131-2512 

1-132  -621 

669-16-11*81 

1-129-652 

IIIA 

1-133-1589 

1-I3i*-1 

669-17-87 

1-131-111 

IIIA 

1-136-507 

1-135-19 

669-18-51 

1-131-653 

IIIA 

1-136-1170 

1-136-306 

669-18-896 

1-131-1697 

IV 

1-137-1717 

1-136-203 

669-20-267 

1-131-2U20 

HID 

1-138-88 

1-136-893 

669-20-560 

1-132-1062 

I V 

1-136-11*53 

1-136-11*71 

669-20-583 

1-132-1073 

IV 

1-139-181*9 

1-138-1661 

669-21-75 

1-132-2039 

IV 

1*1-12-357 

1-139-71*6 

1-133-1187 

IV 

310-7-383 

1-11*0-1902 

1-135-71*0 

IIIA 

669-17-628 

1-11*1-1*52 

N 

1-135-1161 

1-136-1096 

1-137-211 

IV 

hid 

me 

T 

1-11*3-229 

1*1-15-862 

669-16-91*5 

1-131-21*33 

1-131-21*63 

1-132-1991 

1-136-993 

1-137-1*31 

1*1-12-553 

80-20-1136 

1-137-889 

1-137-1600 

1-138-687 

21-29-357 

lil-ll-3l6 

me 

me 

Hie 

IV 

HID 

669-18-1125 

669-19-159 

p 

669-18-831* 

669-21-701* 

B 

61-12-106 

lil-12-l66 

me 

me 

U 

1-138-1191 

669-13-1260 

p 

1*1-12-360 

IIIE 

669-16-156 

1*1-13-162 

me 

669-18-1*19 

U 

1-133-110U 

69-7-112 

IIID 

ft 

1-139-1876 

1*9-8-155 

HIA 

1-129-1088 

1-11*3-652 

1*9-8-160 

IV 

H 

1-130-92 

69-13-282 

1*9-12-297 

me 

1-133-81* 

1-136-22 

1*1-11-552 

J 

1*9-5-233 

1-130-565 

1*9-13-287 

1*9-11**13 

l*9-H*-73 

1*9-17-181* 

61*6-6-111 

669-17-50 

669-18-1*03 

me 

IIIA 

me 

me 

IV 

IIIA 

me 

1-131-617 

1-131-1995 

1-131-2073 

1-132-1512 

1-133-1*1*3 

1-133-151*6 

1-135-1698 

Q 

1-129-2055 

1-132-1885 

1-160-187 

1-160-1629 

1-161-592 

69-7-7 

69-12-297 

80-20-1376 

310-6-2565 

669-16-818 

669-16-1659 

669-20-552 

HIA 

1*9-7-133 

1-137-1172 

D 

80-20-11*21 

1-137-1706 
1-139-32 3 

1-130-929 

1-132-522 

1-132-535 

1-139-11*59 

l-li*0-205l 

1-11*0-2065 

1-11*1-1*52 

1-11*1-553 

1-11*3-1*06 

1-135-181 

1-137-883 

1-11*0-1355 

669-18-908 

9.21.  The  sets  of  articles  included  in  the  clusters 
for  Bibliography  3. 
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Gnawers  to  Selected  Requests: 
AlY^)]-^  1-1,2,20,23,36 

A[Y(b35)]-A3 

A[Y(b5)]«Ali 

AlYfb^-Aj  1-15... 17,22, 2li, 

28,29,32 

AtY(to1)]-A6  1-8... 11,13,27 

A[Y(b6)]-A? 

A[Y(bi)]-Ag  1-18,19 
A[Y(bi)]-A^  1-U,3lt 

AtY(b?)]-A10 
AtY(b3Q)]-Au 


AtYjbj^)]-  Misc.  large  sets  of 

documents  (88-159  articles) 
1-3,12,25,26,31,33 

AlT(blBb21)1“A5 

A[Y(b2b22b2|ib3j)  3-tA^y  Aj  U  ^b?b35f2  ^  ^ 

nit-) 

A[Y(bjb2^)]«(cluster  of  108) 
l[,(b10b18b!9’,3>)1-A12 


Definitions  of  Clusters: 
Alc(blb2bljb2  3b3hb36bl6bl8b2o\j 

dUe 

A2-AiU(b7blh)UF 

VA3U(b5)U« 

A5'(bl5bl6bl7bl8b20b21b22b2ii 

b28b29b32)UDUjLK«ihi) 

A6"^b8b9b10b llb13b27  ^  U  K  U 
(hlh2e5e8) 


VMKsV 

V(V5bllb3lb36>U« 

Aio=A9U(b7)UfIU(e7) 

All'(blb5b7b30)UpU 

(d6ele6e8blh2ml5a17q6) 

Al2'A3UA5U(“i2q7) 


Fig.  9.22.  List  of  answer  clusters  formed  for  Bibliography  3. 
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documents  by  subsection  within  Sec.  III.  Bote  thst  10  of  the  13  docu¬ 
ments  cited  in  subsetion  IIIC  are  included  in  cluster  Aj.. 

Hie  structure  of  the  clusters  in  this  example  was  found  to  be 
considerably  more  complex  than  in  the  previous  two  examples  and  no 
attempt  is  made  to  predict  the  results  of  requests  that  have  not  been 
explicitly  tested.  One  can  gain  some  appreciation  of  the  complexity  of 
the  interrelationships  between  the  clusters  by  an  examination  of 
clusters  A^  to  A^. 

As  with  Bibliographies  1  and  2  there  are  a  few  of  the  documents 
that  are  not  ..  luded  in  the  clusters  of  Fig.  9-23.  Bine  articles  are 
cited  by  Sec.  IV.  All  of  these  except  b^  are  included  in  the  cluster 
Ag.  Thirteen  articles  are  cited  by  Sec.  IIIC.  All  of  them  but  b*2,b31’ 
and  bg-j  are  in  A^  and  all  but  b^  are  in  A^.  The  cluster  A^  is  more 
general  in  that  it  includes  not  only  articles  cited  by  Sec.  IIIC  but 
also  those  cited  by  Sec.'s  IIIA,  D  and  E.  Of  the  27  articles  cited  by 
Sec.  Ill,  20  are  included  in  A^.  The  seven  missing  articles  are  b^,bj, 

b12,b25,b26,b30,  and  b31* 

The  article  b^  was  examined  in  datail  in  an  attempt  to  discover 
why  it  was  not  included  in  A^.  It  was  found  to  have  six  references. 

Of  the  six,  one  was  keypunched  incorrectly.  Two  of  them  are  to  articles 
in  a  Russian  journal  (Soviet  Physics  -  JETP),  whereas  the  other  refer¬ 
ences  to  these  articles  in  the  T.I.P.  file  are  to  the  Journal  in  which 
the  English  translation  is  found.  A  fourth  reference  is  to  a  paper 
written  by  the  same  author  and  not  cited  by  anyone  else,  and  a  fifth  is 
to  n  bulletin,  which  was  evidently  not  sufficient  to  cause  it  to  be  in¬ 
cluded  in  A^g •  lb  was  found  that  if  the  references  had  been  correctly 
keypunched  and  had  been  to  the  correct  English  translations,  b^  would 
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have  been  Included  in  and  probably  A^. 

There  is  one  other  feature  of  the  article  from  which  Bibliography  3 
was  taken.  In  the  final  paragraph  the  author  made  this  comment. 

"I  wish  to  thank  ...A.  R.  Mackintosh  for  calling  B.  I. 

Miller's  work  to  my  attention.” 

The  article  by  B.  I.  Miller  was  checked  to  see  if  it  would  have 
been  included  in  any  of  the  clusters  if  it  had  been  part  of  the  T.I.P. 
file.  It  was  found  to  have  only  one  reference  but  this  reference  was 
sufficient  to  cause  it  to  be  included  in  A^.  Thus  this  procedure 
could  have  performed  the  same  reference  service  that  A.  R.  Mackintosh 
did. 


9.U  Comparison  to  Categories 

In  the  last  section  we  compared  clusters  to  the  bibliographies 
compiled  by  the  authors  of  three  articles.  Another  source  of  sets  of 
articles  that  have  been  judged  to  be  related  would  be  the  subject  index 
found  in  one  of  the  journals  or  in  Physics  Abstracts .  For  this  purpose 
one  category  was  selected  from  the  subject  index  of  Physical  Review  and 
one  category  was  selected  from  Physics  Abstracts. 

9«Ul  Physical  Review  Category 

Most  of  the  categories  in  the  Physical  Review  Subject  Index  are 
very  broad.  The  sets  formed  by  clusters,  on  the  other  hand,  are  in 
general  much  smaller  and  much  more  specific.  Of  course,  larger  clusters 
could  be  formed  by  including  a  large  number  of  articles  in  the  Y  set  of 
the  request,  but  they  would  require  a  large  amount  of  effort  to  process 
and  compare.  For  this  reason  a  category  with  relatively  few  entries  was 


selected.  Its  title  changed  periodically  over  the  three  year  period, 
but  it  was  identified  as  the  one  which  was  referred  to  when  one  looked 
up  the  word,  "luminescence"  in  the  word  list  which  was  supplied  with 
the  subject  index.  The  various  titles  used  for  the  category  are  as 
follows : 


1963 

Luminescence 

(18  articles) 

196h 

I16.I1 

Luminescence  and  Fluorescence 

(6  articles) 

1965 

h2 .3 

Optical  Emission  and  Absorption 

(17  articles) 

1966 

Ui.3 

Optical  Emission  and  Absorption 

(2  articles) 

The  same  format  used  for  presenting  the  dBta  in  Sec.  9«3  is  used 
here  in  Fig.  9.2li-26. 

It  will  be  seen  from  Fig.  9*26  that  most  of  the  papers  separate 
into  the  three  major  areas  represented  by  A^,  A^,  and  k^.  A  statisti¬ 
cal  analysis  of  the  composition  of  each  of  these  three  clusters  is  given 
in  Fig.  9.27.  It  is  found  that  the  only  words  that  appear  more  than 
once  in  the  titles  of  two  or  more  of  the  clusters  are  optical,  absorp¬ 
tion,  radiation,  and  crystals.  The  correspondence  of  these  words  to  the 
title  of  the  original  category  (optical  absorption  and  emission)  is  of 
interest. 

A  similar  analysis  of  the  author  lists  showed  that  N.  Bloembergen 
was  the  only  author  that  appeared  more  than  once  in  two  or  more  of  the 
lists.  The  citation  lists  were  also  found  to  have  very  little  overlap. 
The  greatest  overlap  occurred  between  A^  and  A^.  For  example,  the  1st, 
3rd,  5th,  7th  entries  in  the  list  for  were  found  in  the  list  for  A ^ 

with  a  count  of  2 . 

It  is  thus  concluded  that  the  articles  in  the  clusters  A^,  A^, 
and  'a do  have  different  characteristics.  Whether  the  distinction 
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B 


1-129-169 
1-129-593 
1-129-21*22 
1-130-502 
1-130-639 
1-130-91*5 
1-130-2257 
1-131-127 
1-131-501 
1-131-508 
1-131-1111* 
1-131-11*56 
1-131-151*3 
1-131-2036 
1-132 -221* 
1-132-1023 
1-132-11*82 
1-132-2501 
1-133-1163 
1-136-11*1 
1-136-271 
1-136-508 
1-136-51*1 
1-136-1091 
1-137-508 
1-137-536 
1-137-1117 
1-137-1651 
1-137-1787 
1-138-63 
1-138-180 
1-138-806 
1-138-171*1 
1-139-321 
1-139-5UU 
1-139-1239 
1-139-1616 

1-11*0-155 

1-11*0-263 

1-11*0-601 

1-11*0-1867 

1-11*3-372 

1-11*3-571* 


D 


l-13l*-ll66 

1-137-801 

1-138-1 

1-138-960 

3-82-393 

3-85-565 

3-86-709 

1*1-12-501* 

1*1-13-331* 

1*1-13-657 

1*1-13-720 

1*9-10-52 

1*9-11-291* 

61*6-6-25 


E 


1-139-10 
1-11*0-1051 
1-11*1-287 
1-11*1-306 
1*1-11* -68 
199-138-753 
199-139-202 


F 


1-129-125 

1-132-2023 

1-137-1515 

1-138-11*72 

1-138-11*77 

1-139-1262 

1-139-1991 

1-11*0-352 

i*l-ll*-6i* 

1*9-19-89 


G 


1-139-588 

1-11*0-576 


H 


1-129-1980 

1-132-21*50 


J 


1-131-1912 

1-132-1029 

1-135-950 

1-135-1622 

1-137-1087 

1-138-1287 

I- 139-311* 

II- 31* -1682 

11- 35-1183 

12- 38-151*1* 
12-38-1607 
12-38-2289 
12-39-3118 
12-1*2-1999 
1*9-18-219 
1*9-19-98 
80-18-11*1*3 
80-19-1096 


X 


1-133-1029 

1-136-1*81 

12-1*2-31*01* 


R 


1-133-163 
1-133-1717 
l-13l*-299 
l-ljl»-l*23 
1-135-1676 
1-137-58 3 
1-137-1016 
1-138-276 
1-139-1687 
1-139-1965 
1-11*0-880 
80-19-2260 
669-21-201* 


M 


80-19-921* 


N 


1-11*0-957 
1*9-5-186 
612-1* -2  61* 


P 


1-139-970 


Fig.  9.2i*.  The  sets  of  articles  included  in  the  clusters 
for  Category  1. 


193 


A[Y(b3U)]-A3 

i-33,37,38 


A[Y(b28)]-A$ 

A[Y(b30)]-A6 

A[Y(b1)]-A7  1-8,19 

A[Y(b  )]-Ag 

A[Y(b^)]-A9 

A[Y(b39)]-A1Q 

A[Y(b2)]-Au 


A[Y(b1?)]»Al2 


AtYfb^l-Ay  1*5,12,27 

AtYfb^H^ 


A[Y(b31)]«Al5 

A[Y(bUo)]-Ai6 

AtYtbj^JJ-A^ 

AtYt^)]-^  i-7,22,2U 


AtYtb^l-A^  i-10,11 

AtY<bi)]-A20  1-13,13,20 

A[Y(b25)]-An 

Al.Y(b35)]-A22 

A[Y(bi)]-A23  1-1,6 

AtY(bl5)]-A2li 

A[Y(b1)]-(bi)  1-3, 9, 1*1 

A[Y(bi)]-(large  clusters)  1-23,32,36 

A[Y(bib2bi2)]-(l°7  articles) 

AtY(b28b3li)]-A3UA5-A25 

A[Y(b2gb30b3lj)]-(10U  articles) 

A[Y(b35bii2)]-(large) 

A[Y(bgbi?)]-(1arge) 

A[Y(b2b39)]-(large) 

A[Y(bpQbJin)]-(large) 

A[Y(b27b3Ao)]‘(Al5UA17Ub6^(r2V8blb7) 
A[Y(bi8b2l4b27)]-Ai5gA17  UA18UA20 

U(b6elplf6)’A26 


Definitions  of  Clusters: 


Ar(V33bl*2)UD 

A2«AlU(b26bli3) 

VA9b33b37b38^UE 

A5-AliLKb28) 

V(b30dl} 

A7“Ab19^F^G 

Ag-A^  U  (t>16h2 ) 
a9“A3  U 
A10*  b39^2^ 

All“(b2glg2) 

A12“^b17flgl^ 

A13c^b5b12b27  *  U(r3r9rior12 


)UJ 


Alli”Al3  ^ 

Al6“^blb7b27bUO^^ri*  *  ,r8r9rll 

Al7“(blb7b27buO)UR 

Al8"^b7t22b2i;mlr2r9^ 

A19'('b10blimi^ 

A20“^bl3bl8b20Jl3®l^li 

A2l“(b25k2) 

A22*(b25b35Pl) 

A23’(V6) 

A2h^bl5r7rll) 

a25-a3Ua5 

A26*A15^A17^A18UA20  (b68lplf6) 


Fig.  9.25.  Answers  to  selected  requests  for  Category  1. 
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CLUSTEh  A2? 

CLUSTER  A9 

CLUSTER  A^ 

(30  articles ) 

(18  articles) 

f55  articles) 

109  word*: 

86. 

words : 

216  words: 

13 

raman 

7 

SIC 

12 

ruby 

9 

stimulated 

6 

Exciton 

11 

optical 

6 

laser 

5 

Complexes 

9 

lines 

6 

radiation 

6 

Absorption 

8 

KCL 

6 

scattering 

6 

Luminescence 

8 

spectra 

5 

theory 

3 

CdS 

7 

cryatals 

1* 

fluctuations 

3 

Effects 

6 

absorption 

6 

intensity 

3 

Emission 

6 

thermoluminescence 

3 

effects 

3 

Nitrogen 

5 

excited 

3 

emission 

3 

Optical 

5 

r 

3 

liquids 

3 

Radiation 

5 

MgO 

3 

media 

3 

Recombination 

6 

center 

3 

optical 

n 

Cadmium 

6 

Crt 

3 

order 

0 

0 

6 

irradiated 

3 

waves 

0 

6 

R 

2 

anti 

6 

relaxation 

• 

• 

• 

3 

alkali 

37 

authors : 

25  authors: 

85 

authors: 

T" 

Shen  ¥.  ft. 

6 

Choycke  V.  J. 

T 

Sturge  M.  D. 

a 

Bloembergen  H. 

6 

Hamilton  0.  R. 

5 

McCumber  D.  E. 

2 

Armstrong  J.  A. 

2 

Patrick  Lyle 

3 

Blooanergen  N. 

2 

London  R. 

2 

Dean  P.  J. 

3 

Schawlow  A.  L. 

2 

Smith  Archibald  W. 

2 

Reynolds  D.  C. 

3 

Ten  W.  M. 

2 

Tang  C.  1. 

1 

Anders  W.  A. 

2 

Arten  J.  0. 

1 

• 

Anderson  B.  0. 

0 

• 

e 

• 

• 

• 

2°2 

citations: 

268  citations: 

866  citations: 

12~ 

1-127-1915 

IT 

61-6-}6l 

22 

“8o-13^88o 

10 

1-130-2529 

11 

1-128-2135 

15 

1-122-381 

10 

1-131-2766 

11 

61-1-650 

15 

12-36-2757 

10 

1-133-37 

10 

1-127-1868 

16 

11-36-1682 

10 

61-9-655 

8 

1-131-127 

13 

1-122-1669 

10 

6l-ll-l6o 

7 

1-116-67  3 

10 

1-130-6*9 

10 

1:9-7-186 

6 

1-133-1163 

10 

12-20-1752 

9 

666-3-181 

5 

1-120-1666 

0 

80-1S-399 

8 

61-11-619 

5 

1-127-1878 

8 

1-57-62- 

8 

Ul-12-501* 

5 

1-132-20 23 

8 

30-31-95o 

7 

1-13U -lli29 

6 

(5  citations) 

7 

(j  citations) 

7 

666-3-137 

3 

(7  citations) 

0 

(12  citations) 

6 

61-12-290 

2 

(62  citations) 

5 

(8  citations) 

5 

(5  citations) 

1 

(186  citations) 

6 

(18  citations) 

a 

(ll  citations) 

• 

• 

3 

(33  citations) 

3 

(17  citations) 

* 

2 

(121  citations) 

2 

Hit  citations) 

1 

(761  citations) 

1 

(212  citations) 

Pig.  9-27.  Coay>«ri«oc  of  the  itr**  clusters  formed  for  Category  1. 
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between  the  clusters  Is  of  practical  significance  to  a  user  would,  of 
course,  require  further  experimental  Justification. 

As  an  additional  comparison  the  results  of  this  section  were  com¬ 
pared  with  the  articles  found  in  the  category  in  Physics  Abstracts  with 
the  title,  "luminescence."  This  category  contained  22  of  the  articles 
listed  in  Fig.  9>2lt.  (lit  in  set  B  and  8  others.)  All  of  these  22 
articles  were  included  in  A^  or  A^.  This  would  tend  to  indicate  that 
the  Physics  Abstracts  indexers  considered  the  articles  of  A^  to  be  in 


Since  a  property  (luminescence )  was  chosen  for  the  last  section, 
it  was  decided  that  a  category  covering  a  substance  might  be  appropriate 
for  this  test.  We  again  sought  a  category  with  relatively  few  entries 
so  that  it  would  be  easier  to  compare  it  with  the  related  clusters. 

The  category  with  the  heading,  "Erbium",  was  selected.  The  articles 
classified  in  this  category  from  January  1963  to  the  present  are  listed 
in  set  B  of  Fig.  9-26.  Fig.'s  9-29  and  9.30  present  the  related 
clusters. 


In  the  last  two  sections  we  compared  the  results  of  the  clustering 
procedure  to  the  three  bibliographies  and  two  categories.  In  this 
section  we  will  present  the  response  of  the  system  to  some  actual 
requests  for  information.  The  response  to  both  a  relatively  simple 
request  and  to  a  more  camples  request  are  studied. 


i 


j  - 


197 


B 


1-131-101*3 

1-131-1586 

1-132-1609 

1-137-138 

I- 137-1109 

II- 35-101*7 
11-36-1001 
11-36-1127 

11- 36-121*9 

12- 38-2190 
12-39-1285 
12-39-1629 
12-39-2128 
12-1*0-2751 
12-1*0-3606 
12-1*1-1225 
12-1*1-3363 
12-1*2-873 
12-1*3-81*7 
29-29-1*77 
1*9-8 -5 
1*9-11-100 
1*9-13-112 
1*9-15-301 
1*9-16-265 
1*9-17-95 
80-20-808 
80-20-1332 
199-137-790 
310-6-2225 

D 

1-129-2072 

1-130-1337 

1-130-1825 

1-131-932 

I- 131-1039 
1-138-216 
1-139-1606 
1-11*0-1896 
3-81-81*6 

3 -81* -63 
3 -81* -693 

II- 36-906 
11-36-1078 

11- 36-3628 

12- 39-11*1*9 
29-31-1 
1*9-6-19 


F 


1-131-158 

1-13U-1620 

1-137-1139 

I- 138-21*1 

3-85-955 

II- 36-1209 
1*9-17-96 

r 

1-132-5 1*2 
1-133-219 

I- 131* -91* 

II- 35-800 
12-1*3-2087 

G 


1-129-1601 

1-130-1100 

1-133-1571 

1-13U-320 

l-13l*-ll*92 

1-136-175 

1-136-231 

1-136-271 

1-136-711 

1-136-717 

1-136-726 

1-137-627 

1-137-11*1*9 

1-11*0-1968 

1-11*1-352 

1-11*1-1*61 

3-81-663 

12-39-U*22 

12-39-11*55 

12-39-3503 

12-1*2-377 

12-1*2-981 

12-1*2-11*23 

21-29-91*8 

21-31-81*5 

21-31-1325 

1*9-10-16 

1*9-10-1*96 

310-7-1150 


H 


1-139-21*1 

3-82-871* 

12-38-2750 

12-1*2-1*000 

12-1*3-1680 

8O-I8-I636 

J 

1-130-2325 

1-132-280 

1-133-881 

1-136-11*33 

1-11*0-2005 

1-11*2-115 

12-1*1-565 

12  -l*i-6ri 

1*1-11-196 

K 

l-llll-l* 

U3 -36-505 
1-137-1886 
1-139-2008 
3-81*- 297 
12-38-976 
12-38-2171 
12-39-3251 
12-1*0-796 
12-1*0-31*28 
12-1*2-162 
12-1*2-993 
12-1*2-3797 
12-1*3-2121* 
1*1-11-253 

M 

1-130-91*5 

1-130-1370 

1-133-31* 

1-133-1*91* 

l-13l*-172 

l-13i*-l50i* 

1-137-171*9 

1-138-1682 

1-11*1-259 


M  (Con*t. ) 

12-39-1021* 

12-39-1151* 

12-1*0-71*3 

12-1*1-892 

12-1*2-71*3 

16U-39-3U2 

310-7-11*50 

H 

I- 138-15U* 
12-38-11*76 
12-38-2190 
12-39-2131* 
12-1*1-1305 
12-1*1-3227 
12-1*3-1702 

P 

1-133-1361* 

1*9-19-1*63 

Q 

12-1*1-1970 

R 

II- 36-21*22 
80-20-997 

S 

1-133-1361* 

T 

21-29-971* 

1*9-20-1*96 

U 

669-17-1118 

669-18-1022 

V 

1-135-97 

w 

1-11*0-1188 

I- 11*1-251 

X 

II- 36-981* 
12-1*1-892 


Fig.  9.28.  The  sets  of  articles  included  in  the  clusters 
for  Category  2. 
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AtKbj)]^ 

Answers  to  Requests: 

1-1,6,11.20  aEy^JJ-A^ 

AtY^)]^ 

AEYCb^)]^ 

AtY(b28)]-Ai6 

A(X(b17)]-Au 

A[Y(b29)]-A1? 

AtYCb^)]-^ 

AEY(b26)]-Ai8 

AtY(b30)]-A6 

A[Y(b25)]-(b25) 

AtKb^)]^ 

A[Y(bi0)]-Ai9 

ArY(bi5)]-Aa 

A[Y(b13)]-A2Q 

AlKbjg)]^ 

A[Y(b5)]-A2l 

AWbtfJl^M 

AttEb^l-A^ 

A[Y(b23)]-Au 

A[Y(b1)]-A23 

A[Y(bi)]*A12 

i-22,2lt  AtYfb^l-A^ 

A[Y(b9)]-A13 

Definitions  of  Clusters: 

Aiir<VUR 

Al5“(b12°2)US 

Al6“^b28g26mi5^T 
Al7*(b29> 

Al8“(b26)UV 

A19“^b10blUb17g10g19g23g26h2J2Jl4J7 
kUk7k10k13D1IlnlD6n7) 

A20'^b13b17b19g3gi»gli4g17gl8g19g21g22g26 
h2b3hhJ7k3kak5k6kllklaD12nh ] 

A2l“(b5bl6g8J6klu)Uw 

A22"(b2b17b20d5d7eUf3g2'*  ,g6g12“  ,gl5 
gl7gl8g21g23g25g27h2b3lVl’  '  ‘^ll^ 

A23*^b3bll*bl8b21b30^5g5gl5gl8g2?g29 
blJ8Jo^oin2xl3t2_' 

A2h^A23U4Tn(^) 

Pig.  9»29«  Answers  to  selected  requests  for  Category  2. 


Al-<blb6bllb20)U° 

Ag-AiUO^JUE 

A3-A2U(b?)UP 

V(b3Vl7)UG^Kel1) 

VauU<VUh 

VA5^(b2b20b30d5d7f3)UJ 

A7‘A6U(Vl9faklk2) 

All”A10  U  ^b2lb23f5  )U  p 

Al2“(b22b2l»dljflJ) 

A13-(b8)UQ 
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This  test  was  performed  in  cooperation  with  a  research  physicist 
from  Lincoln  Laboratory.  His  initial  request  consisted  of  the  following 
relatively  brief  specification: 

words:  turbulence 

subsonic 
hypersonic 
wake 

authors:  Lees 

Hromas 

articles:  none 


!■  perhaps 


lo  articles  were  found  which  were  written  by  the  two  authors 
(actually  there  were  three  papers  by  a  Lees  but  in  e  cosg»letely 
different  area).  There  were  70  articles  that  had  either  "turbulence 
or  "turbulent  in  their  titles  (set  T  of  Fig.  9.3l).  There  were  27 
which  contained  one  or  more  of  the  words  "wake,  "subsonic",  or  "hyper¬ 
sonic".  (Set  W  of  Fig.  9*31.) 

At  this  point  s  number  of  the  articles  in  Set  T  were  used  as 
requests  to  the  clustering  procedure.  The  cluster  structure  shown  in 
Fig.  9*32  and  9.33  resulted.  The  physicist  was  asked  to  evaluate  the 
pertinence  of  each  of  the  articles  presented.  He  gave  three  types  of 
responses:  pertinent  (y),  non-pertinent  (n),  and  questionable  perti¬ 
nence  (m).  The  responses  are  indicated  in  Fig.  9«31  and  also  in  Fig. 
9*32  by  the  superscripts.  It  will  be  noted  that  nine  of  the  twelve 
articles  specified  as  pertinent  are  in  the  cluster. 

The  physicist  was  asked  if  there  was  any  detectable  difference 
between  the  article  in  the  A^  and  A^  clusters  which  were  disjoint  by 
the  procedure.  Of  the  16  articles  in  A j,  l£  were  from  Russian  Journals, 
while  27  of  the  35  articles  in  A^  were  from  American  Journals.  It  was 
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T 


T  (Con*t.)  _ _ W 


D 


11-36-2075 

y 

799-6-1016 

m 

11-36-2201 

n 

799-6-101*8 

m 

21-31-li*l 

n 

799-6-1250 

n 

29-30-17 

y 

799-6-1260 

n 

1*1-11* -813 

n 

799-6-1693 

m 

l*l-H*-892 

n 

799-7-190 

n 

1*1-15-381 

n 

799-7-335 

a 

1*9-9-11*1* 

n 

799-7-562 

m 

1*9-12-201 

y 

799-7-629 

m 

1*9-13-297 

m 

7'9-7-8l6 

m 

1»9-18-2?1* 

n 

759-7-1030 

m 

80-19-11*30 

n 

799-7-101*8 

a 

38U-32-292 

n 

799-7-1156 

y 

61*6-7-285 

y 

799-7-1160 

m 

669-16-295 

n 

799-7-1163 

m 

669-16-1578 

n 

799-7-1169 

m 

669-17-1*03 

m 

799-7-1178 

m 

669-17-11*1*9 

n 

799-7-1191 

n 

669-18  -81*7 

n 

799-7-11*03 

n 

669-18-1251 

n 

799-7-1723 

y 

669-18-1268 

m 

799-'-l?35 

y 

669-19-3U9 

m 

799-M920 

n 

669-20-1*1*5 

n 

799-8-391 

n 

669-20-1519 

n 

799-8 -U92 

n 

669-21-71 *1* 

y 

799-8-575 

m 

669-21-771* 

m 

799-8-598 

y 

669-21-1161 

n 

799-8-1063 

m 

790-6-882 

n 

799-8-1509 

n 

790-6-1017 

m 

799-8-161*7 

n 

790-7-31*1* 

n 

799-8-1659 

n 

790-8-51* 

n 

799-8 -1775 

m 

790-9-1057 

n 

799-8-1792 

y 

790-9-11*29 

n 

799-8-2219 

y 

790-10-191 

n 

799-8-2225 

n 

790-10-101*1 

n 

821-2-332 

n 

l-l3l»-58l 

1-135-1761 

I- 138 -93U 
3-82-669 

II- 36-31* 
Ul-10-127 
U1-13-U37 
1*1-12-592 
U1-13-7U2 
1*1-15-31*6 
1*9-19-1*59 
80-18-288 
80-18-1515 
61*6-1*  -28 
61*6-7-187 
799-6-91*6 
799-6-1388 
799-7-197 
799-7-667 
799-7-111*7 
799-7-U98 
799-8-1*1* 
799-8-2U 
799-8-956 
799-8-11*28 
799-8-11*56 
799-8-1792 


11-36-3609  y 
17-32-298  n 
669-18-698  n 
669-18-1011*  n 
669-19-1*99  n 
669-19-1165  n 
669-20-135  n 
790-10-605  n 
799-6-1603  n 


Fig.  9.31.  Sets  of  articles  Included  in  the 
clusters  for  Pnysiciat  1. 
(y«pertinent,  n*non -pertinent, 
a-nueationahle  pertinence) 


f-wp*  - 
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AtT(t1)l-(tu6tU7tU9t5ot55t6o 

t62t61*t65t68d9) 


-A.  i«Li6,U7, 5*9,50,55, 

1  60, 62,61*, 65, 68 


Atrtdg)]^ 

At»(*36il"A1U(t36) 

AtY(t52)]-A1U(t36t52) 

AlY(tlt8)]-A,U(t36tU8J 

AtY(t6i)]-A1U  tt3£tSl8t52t6l) 

A[Y(t5l)]-A1y(t36tU8t$2t6lt  5l) 
-A2 


A[Y(t1) ]-(tl9t23t2Ut25t26t27 

d3dhd5d6d7d3^ 

-A6  y  i-19,2U,25,26,27 

A[Y(di)3-A6y  i-3,U,5 

A[Y(t32)]-A6y(t32) 

A[Y(tl7)]-A6y(t32t22ti7) 

a[i(  tg )  ] -A6y  ( t32t22t17t8 ) 

AtY(tl6)3-A6y(t32t22ti7tl6) 

i-37,66 


A[l(tl3)]-(t12tl3t38) 

A[t(t31)]-(t31t3Ut65) 

A[Y(t33)]-(t33t3Qt65) 

A[Y(^>Mtj8V»>  1“38,U3,53 

A[Y(tli(l-(tut68) 

AtKtigiHt^) 

Af.Y(t28)]-(tl6t28) 

AtY(t9u)l-(t5uwl7) 

A[Y(d2)]-(d2t67) 

AtY(t70))*(^5t70^ 
AtYWMd^) 
kW\)]^t2t69)  1-2,69 

AtY^jMt^)  1-3,12 
AMt^Mt^)  1-5,20 
A[Y(t1)]-(t9t23)  1-9,23 

AtY(tt)]-(t21t22)  1-21,22 

A[Y(^t)l-(H3U*39)  1-31**39 

AtYU^Mt^V?)  1-53,56 
AlYd^Ht^) 


U2,Ui,U5,59,63 


Fig.  9-32.  Answers  to  selected  requests  for  Phy»lci»t  1. 
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m  a  .m  m  .y  .m  .m  ,n  .y  ,n 
tl>6t^7\9?0^60t62^t6$t68  \ 

J»  ty  ..y 

t36tU8t5lt52t6l 


•n  tn  tn  tn  ty  tn 

^2t13t33t37t38tli3t56t58t66  „ 

ty  ta  *n  tn 
*U  t31t3Ut39 

* 

n  y 
'3  tlU 

tn  +y 
<28*67 

4 

n  n  m  m  n 
tl6  tl7  *22  t32 


,n  .  n  ,  n  .  y  .  ni  .  n  »**  »•*  *•*  «h  <m  %»• 

t19t23t2i1t25  26t2?  d3  %  S  d6  *7  d8 


n  jD  ,n  .n  .n  .n 1 


tn  t? 

fci8  no 


wi  vi h  wi5 


Pig.  9.33.  Relationship  of  answer  cluster a  for  Physicist  1. 

(y«pertinent,  n*r.on -pertinent,  n*questionable  pertinence) 


initially  thought  that  the  cause  of  the  separation  of  the  two  clusters 
was  probably  due  to  the  fact  that  the  Russians  generally  cited  Russians 
while  the  Americans  cited  Americans*  After  examining  the  two  sets,  the 
physicist  expressed  the  opinion,  however,  that  Kj  appeared  to  be  more 
concerned  with  the  upper  atmosphere  and  ionosphere. 

Also  aupportlcg  the  contention  that  there  is  a  valid  and  useful 
distinction  between  and  A^  is  the  fact  that  nine  of  the  eleven 
articles  Judged  to  be  pertinent  were  from  the  cluster. 

Because  of  the  incompletely  inverted  files  and  the  delays  caused 
thereby,  the  actus'  arches  were  performed  by  the  author  of  this 
theris  aid  later  discussed  with  the  physicist.  It  was  interesting  to 
note  that  at  one  point  in  the  discussion,  he  stated  that  he  could  have 
more  correctly  shaped  the  final  cluster  by  being  able  to  specify  as  non¬ 
pertinent  some  articles  on  turbulence  in  helium  that  appeared  in  one  of 
the  clusters. 

We  note  in  passing  that  the  physicist  who  aided  in  this  test  ia 
the  author  of  article  V 


In  this  section  an  example  is  given  of  how  the  clustering  procedure 
might  be  used  to  supplement  or  extend  an  elready  • liable  collection  of 
paper*  cm  e  given  subject. 

A  bibliography  of  112  articles  on  Imngmulr  probe*  was  supplied  to 
the  author  by  another  reeearch  physicist  at  Lincoln  Laboratory,  of  the 
112  articles,  99  sre  to  Joumsls,  sre  to  the  2?  Journal*  covered  by 
the  T.X.P.  file,  and  21  sr*  setuelly  la  the  T.i.P.  rile,  the  identifi¬ 
cations  of  the  21  srtlels*  la  the  T.X.P.  file  sre  gives  in  fig.  9.1«. 


Fig.  9<35  shows  the  distribution  of  the  articles  in  the  file  with  tine. 
Fig.  9*36  lists  the  words  occurring  in  five  or  sore  of  the  112  titles. 
In  this  list  words  such  as  "of,  "the",  "theory",  etc., have  been  omitted. 
Also  words  have  been  grouped  by  stem,  thus,  the  words,  "ion",  "ions", 
"ionized",  etc.,  are  all  grouped  under  the  word,  "ion". 


Set  B 

3-82-21*3 

11-314-1165 

11-3U-3209 

11-35-1130 

11-36-337 

11-36-675 


Fig.  9.3U.  21  Articles  in  Uingeuir  Probe  that  are  in 
T.l.P.  file. 


Number  of  Articles 


9  9  year  9 

5  6  6 

0  0  6 


B  tCon't.) 

11-36-1866 

11-36-2363 

21-30-182 

21-30-193 

21-30-3,5 


B  (Con1 1. ) 

149-11-126 
80-18 -260 
80-18-1908 
690-8-720 
799-6-11.79 


B  (Con't.) 

799-6-11*92 
799-14-U433 
799-7  -I81i3 
799-8-56 
799-8-73 


Fig.  9.35.  Publicetlon  yeer  diswifcuttcw  of  initial 
Uagrulr  Probe  bibliography. 


Words 


lumber  of  articles 


probe 

87 

plasma 

Uo 

Langmuir 

35 

ion 

18 

8« 

15 

discharge 

13 

electron 

12 

collection 

10 

density 

8 

low 

7 

pressure 

6 

spherical 

6 

electrostatic 

6 

probe  and  plasma 

32 

probe  and  Langmuir 

35 

probe  and  ion 

16 

probe  and  gas 

7 

probe  and  discharge 

6 

Fig.  9*36.  Title  word  distribution  for  the  112  titles  of 
the  initial  Iengmulr  probe  bibliography. 


As  an  additional  part  of  this  test  it  was  decided  that  five  other 
types  of  search  strategies  would  also  be  used  and  their  results  would 
be  compared  to  the  results  of  clustering.  The  five  search  strategies 
selected  will  now  be  described. 

TITLE  WORD  SEARCH 

One  possible  search  strategy  would  be  to  retrieve  all  those 
articles  which  have  some  word  or  logical  combination  of  words  in  their 
titles.  The  choice  if  the  word  or  words  to  be  used  was  made  on  the 
basis  of  the  frequency  of  occurrence  of  the  words  in  the  bibliography 
(Fig.  9*36)  and  in  the  T.I.P.  file  and  with  the  advice  of  the  physicist. 
Several  test  runs  were  Bade  with  various  word  combinations.  A  simple 
request  for  all  articles  with  the  word, "probe" ,  in  their  titles  was 
selected.  lMs  retrieved  58  articles  including  20  members  of  the 
original  bibliography. 


author  search 


There  are  Hit  different  authors  of  the  112  articles  in  the  biblio¬ 
graphy.  A  search  of  the  T.I.P.  file  for  articles  by  these  lilt  authors 
yielded  120  articles  (21  froa  the  original  bibliography  and  99  other 
papers).  This  search  was  not  exhaustive  but  involved  looking  for 
authors  only  in  those  Journals  where  it  was  thought  they  algnt  publish. 
CITATIOH  SEARCH 

The  third  type  of  search  consisted  of  finding  all  of  the  articles 
that  cite  one  or  more  of  the  112  articles  in  the  bibliography.  A 
search  of  the  T.I.P.  file  using  this  criteria  yielded  78  articles. 
BIBLIOGRAPHIC  COUPUHG  SEARCH 

When  two  papers  cite  one  or  nore  of  the  sane  papers  they  are  said 
to  be  bibliographieally  coupled  (Sec.  0.22).  There  are  270  articles 
that  are  bibliographieally  coupled  to  one  or  more  of  the  21  articles 
in  set  B  of  Fig.  9-3U. 

The  coupling  strength  between  two  papers  is  defined  to  be  the 
number  of  identical  citations  that  they  have.  The  coupling  strength 
between  one  paper  and  a  set  of  papers  is  defined  to  be  the  number  of 
citations  in  the  single  paper  which  are  also  found  in  one  or  more  of 
the  papers  in  the  set.  In  Fig.  9.37  we  show  the  distribution  of  the 
270  articles  by  their  coupling  strength  to  the  set  B. 

jonrnr  cited  search 

Bibliographic  coupling  occurs  between  two  papers  if  they  cite 
one  or  more  of  the  same  papers.  Another  type  of  coupling  occurs  if 
two  papers  are  cited  by  one  or  more  of  the  same  papers.  There  are 
605  papers  which  occur  in  one  or  more  bibliographies  with  articles  of 
set  B.  Of  the  605.  101  are  in  the  T.I.P.  file. 
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Humber  of 
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Fig*  9*37*  Distribution  of  articles  with  various  bibliographic 
coupling  strengths. 


ClttSTERIBG 

The  user  specified  the  article  b^  as  the  article  of  greatest 
interest  in  the  bibliography.  The  articles  b^,  bg,  b^,  and  b^  vere 
ranked  next  in  terns  of  interest.  The  clusters  which  resulted  when 


these  and  various  other  articles  were  used  as  requests  to  the  system 


are  shown  in  Fig.'s  9*36  -  5.i*0. 


11-31* -189? 

55-3*1-132 

80-19-1915 

612-2-719 

799-7-2  329 

799-8-71*8 

E 

3-83-971 

11-36-3135 

11-36-311*2 

11-37-180 


S  (Con’t.) 

0 

J 

1*1-11-  310 

3-83^7 T~ 

n-35-13# 

1*1-15-286 

11-35-130 

790-10-1102 

6ii6-Ii-l86 

55-1*1-391 

799-6-1762 

y 

55-1*1-11*05 

799-7-1831* 

3-81-662 

790-7-921 

K 

11-36-31*2 

H 

799-7-110 

799-8-920 

799-8-2097 

85TO5T 

11-36-2361 

11-36-3526 

612-3-18 

790-7-7CJ 

80-18-1056 

80-20-81*5 

612-2-58 

ll-3?-377 

Fig.  9*38.  The  sets  of  articles  included  in, the  clusters 
for  Lar.gmuir  Probe  Eibliocraphy  (Ply si  '.st  2}. 
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Answers  to  Requests: 

A[Y('b1)J-A1  i-1^,16,17 

A[Y(bi)]»A2  i-1, 7 

AtY(t»i)]-A3  i -8, 9,11 

AtY(t»3)]-Au 

AtY(to1)]-A^  1*1,6,20,21 

A[Y(b19)]*A6 

A[Y(b5)]-A? 

AlY(b2)]-Ag 

A[Y(b  )]-(cluster  of  82  articles) 

Definitions  of  Clusters; 

V(Vu*Vi7)UD 

V(blb7VlU)UE 

A3‘(b3b8b9bllbi9)U(W5)Ur 
AL'(b3b8b9)  U<  Wl,)U0 

A5'(V6b8bl6b20b21)U(d2gl«U) 

A6'^bl6bl7b19b20b2 1  ^  H 
A?-(b5f5)UK 


A[Y(b12)]«A9 

AlY(bi5n‘Aio 
A[Y(b1)]-(bi)  1-13,18 

*hfysYl*Vl9)1,*ll 

,!l^MWuVA 
b19b20b21^'  *12 
A[Y(d1)]-A1  i-1, ...,6 

AtY(ei)]-A2  i-1, 3,..., 6 

A[Y(e2)]-Ag 


A8‘(b2b19d5ele2V2)UJ 

A9*^b12blliele2ml^ 

Aio-(bi5f5J2) 

AllcA5^^bl7bi9fi^ 

Ai2-AiUA2UA3UAijL)A5U(VlJ2) 

A13“A12^J^b2^3^^ 


Fig.  9*39*  Answers  to  selected  requests  for  Langmuir  Probe 
Bibliography  (Physicist  2). 


ll,lw 


d2...d6 


Fig.  9. hO.  Relationship  of  Clusters  for  Langmuir  Probe 
Bibliography  (Physicist  2). 


COMPARISON 

The  six  preceding  search  strategies  produced  a  total  of  about  500 
different  articles.  It  was  decided  that  this  constituted  too  large  a 
file  to  ask  the  user  to  evaluate.  The  file  vas,  therefore,  reduced  to 
the  lOh  articles  which  appeared  to  have  the  greatest  chance  of  being  of 
interest  to  the  user.  These  included  the  83  articles  which  were  retrieved 
by  two  or  more  of  the  six  search  strategies,  the  1 5  additional  articles 
which  were  bibliographically  coupled  to  the  set  B  with  a  value  of  three 
or  more  and  another  six  articles  which  contained  the  word,  "probe",  in 
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their  titles  in  the  sense  of  a  measuring  device.  In  seven  other 
articles  the  word,  "probe",  was  found  in  the  title  but  it  was  used  as 
a  synonym  for  investigation  (e.g.  "three-field  model  as  a  probe  of 
higher  group  symmetries" ) . 

The  101*  articles  presented  for  evaluation  are  listed  in  Fig.  9.1*1. 
The  first  column  (A)  is  the  identification.  The  next  column  (B)  con- 
tcins  an  indication  (l)  of  those  articles  which  are  members  of  set  fl. 

The  next  six  columns  (C-H)  note  which  articles  were  retrieved  by  each 
of  the  six  search  strategies: 

C  -  Column  contains  a  one  if  the  paper  has  the  word,  "probe",  in 
its  title. 

D  -  Humber  of  authors  of  the  paper  that  are  also  authors  of  112 
papers  in  the  Bibliography. 

E  -  Humber  of  the  112  papers  in  the  Bibliography  that  are  cited  by 
the  paper. 

F  -  Bibliographic  coupling  strength  of  the  paper  to  the  set  B. 

0  -  Humber  of  papers  which  cite  the  paper  and  also  cite  one  or 
more  of  the  112  papers  in  the  Bibliography. 

H  -  Symbol  of  the  paper  in  the  clusters  of  Fig.  9*38  to  9-1*0. 

(Note  that  the  counts  in  Columns  D  and  F  do  not  include  the  authors 
or  citations  which  match  only  because  the  article  itself  is  in  the 
set  B. ) 

The  last  column  (j)  contains  the  evaluation  code.  Each  document  was 
assigned  to  one  of  the  following  five  categories: 

1  -  Of  personal  interest  tc  user. 

2  -  Of  general  interest. 

3  -  Perhaps  of  general  Interest. 

(e.g.  a  probe  may  have  been  used  as  a  tool  in  the  experiment.) 
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Fig.  9. ill.  Langmuir  Probe  paper*  evaluated  by  phyaiciat. 

(Exple.nationa  of  coluana  are  given  In  text.) 
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h  -  Degree  of  interest  cannot  be  determined  by  examination  of  the 
author (a). 

5  -  Hot  of  interest. 

In  Fig.  9.U2  the  results  of  each  of  the  six  search  strategies  are 
tabulated  for  comparison.  The  results  for  bibliographic  coupling  are 
separated  into  two  entries  depending  on  the  coupling  strength. 

An  examination  of  Fig.  9.ii2  indicates  that  the  search  strategies 
using  the  author,  citation,  and  cited-by-seme  criteria  yield  compara¬ 
tively  large  sets  of  documents  containing  relatively  few  of.  the  articles 
Judged  to  be  of  specific  pertinence  by  the  user  (evaluation  category  l). 

Bibliographic  coupling  with  the  coupling  strength  greater  than  or 
equal  to  one  yields  such  a  large  set  of  articles  (270)  that  it  would  be 
more  appropriate  to  compare  it  with  a  larger  cluster  such  as  the  85- 
article  cluster  which  contained  26  of  the  category-1  documents.  Let  us 
therefore  compare  cluster  with  the  set  of  articles  with  coupling 
strength  greater  than  or  equal  to  two.  It  will  be  seen  that  is  leas 
than  half  as  large  and  yet  contains  three  more  of  the  category-1  docu¬ 
ments. 

It  will  be  observed  that  the  clustering  procedure  uses  the  same 
data  used  in  bibliographic  coupling  but  in  a  different  way.  Consider, 
for  example,  the  27  articles  in  A^  which  are  not  part  of  the  original 
bibliography.  Seven  have  a  coupling  strength  to  B  of  only  1  and  six 
have  a  coupling  strength  of  2.  Whereas  an  articles  like  1-129-1181 
with  a  coupling  strength  of  7  is  not  included  in  A^* 
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Starch  Strategy 

Number  of  articles 

Number  of  articles  in  each 
evaluation  category 

retrieved 

1 

2 

3 

u  5 

Title  word 

58 

30 

11 

1 

2 

6 

Author 

120 

18 

10 

i5 

2 

8 

Citation 

78 

16 

7 

8 

0 

5 

Bibliographic  coupling 
(strength  _  2 

88 

19 

10 

19 

0 

9 

Bibliographic  coupling 
(strength  _  l) 

270 

26 

12 

29 

2 

15 

Clted-by-same  articles 

101 

13 

8 

4 

0 

7 

Clustering  (A^) 

43 

22 

8 

7 

0 

6 

Total 

abt.  500 

31 

16 

32 

4 

21 

Fig.  9.42.  Comparison  of  results  of  seven  search  strategies. 


Let  us  now  turn  our  attention  to  the  title  word  search.  Fig.  9*42 
incidates  that  this  search  strategy  retrieved  four  more  of  the  category- 
1  documents  than  were  retrieved  by  the  search  strategies  based  on 
citations  (i.e.  bibliographic  coupling  and  the  85-document  cluster). 
Dlls  result  provides  an  example  of  a  case  where  title  words  provide  a 
better  basis  for  retrieval  than  do  citations.  Previous  experience 
would  Indicate  that  such  is  not  generally  the  case. 

To  determine  why  the  clustering  procedure  was  less  effective  in 
this  case  the  five  category-1  documents  which  did  not  appear  in  any  of 
the  clusters  generated  were  examined.  It  was  found  that  three  of  them 
(bij,  b^,  and  21-29-1165)  contain  only  a  single  citation  and  the  other 
two  (b^g  and  21-29-1313)  contain  only  two  citations.  We  are  thus  led 
to  the  same  conclusion  arrived  at  earlier  that  the  clustering  system, 
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in  general,  has  trouble  properly  placing  documents  with  three  or  fewer 
citations. 

The  remedy  for  this  difficulty  would  be  to  use  some  additional 
types  of  partitioning  data.  In  the  example  at  hand,  all  31  of  the 
category-1  documents  could  be  retrieved  in  the  same  cluster  if  the 
system  used  not  only  the  partitions  generated  by  citations  but  also 
those  gc- ’rated  by  certain  keywords  like  "probe” . 

One  other  observation  may  be  worth  noting.  The  article,  b^,  was 
part  of  the  original  bibliography  but  was  not  included  in  any  clusters 
with  other  members  of  the  bibliography.  A  check  of  its  bibliography 
showed  that  it  had  nine  citations, which  experience  indicated  should  be 
enough  to  place  it  in  the  correct  cluster.  The  author  of  this  thesis 
decided,  therefore,  to  ask  the  physicist  if  ms  in  a  different  area 
from  the  other  20  members  of  the  bibliography.  Before  this  was  asked, 
however,  the  evaluation  of  the  10U  articles  of  Fig.  9*1»J.  was  made.  A 
check  of  this  evaluation  revealed  that  19  of  the  21  members  of  the 
original  bibliography  were  placed  in  evaluation  category  1  while  b^ 
was  placed  in  category  3. 

9.6  Summary  of  Results 

For  purposes  of  comparison  and  emphasis  let  us  summarise  some  of 
the  significant  features  of  the  lsst  three  sections.  In  Fig.  9.1*3  two 
measures  of  the  success  of  the  clustering  procedure  are  tabulated. 
Column  four  indicates  how  many  of  the  pertinent  articles  were  retrieved 
by  the  clustering  system  in  each  test.  Column  five  Indicates  what 
fraction  of  the  articles  retrieved  were  pertinent.  The  particular  clus¬ 
ter  selected  for  each  test  is  specified  in  parenthesis  in  column  three. 
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Marne  of  That 

Humber  of 
papers 
specified 
as  pertinent 

Bibliography 
(Sec.  9.31) 

1 

10 

Bibliography 
(Sec.  9.32) 

2 

16 

Bibliography 
(Sec.  9.33) 

3(111) 

27 

Bibliography 
(Sec.  9.33) 

3(IV) 

9 

Bibliography 
(Sec.  9.33) 

3  (me) 

13 

Category  1 
(Sec.  9-1*1) 

1*3 

Category  2 
(Sec.  9-U2) 

30 

User  1 
(Sec.  9.51) 

12(y) 

User  2 
(Sec.  9-52) 

31(1) 

Percent  of  Percent  of 


Size  of 
Related 
Cluster 

pertinent 
papers  In 
dust-  - 

cluster 
specified  m 
pertinent 

17  (Ag) 

9/10-90* 

9/17-53* 

6i*(Au) 

ll*/ 16-88 

li*/  61*-22 

1*8(Ai2) 

20/27-71* 

20/1*8-1*2 

3l(Ag) 

8/9-89 

8/31-26 

22 (A$) 

10/13-77 

10/22-ii6 

105 

(A9UA2^JA26) 

28/1*3-65 

28/105-27 

133 

(AlUAu) 

19/30-61* 

19/1 33-H* 

59(A10) 

9/12-75 

9/59-15 

1*3(A13) 

22/31-71 

22/1*3-51 

Pig.  9.Ji3.  Summary  of  the  experimental  result*  of 
Section*  9-3-5. 


One  additional  statiatic  may  be  of  interest.  Itals  relates  to 
whether  the  documents  that  are  pertinent  to  a  search  are  added  to  the 
cluster  early  or  late  in  the  process,  for  this  purpose  50  cluster  ' 
from  See.  9-33  and  9 .Itl  were  analysed  and  the  number  of  articles  of 
specified  pertinence  added  in  each  quarter  of  the  process  was  noted. 
These  figures  were  averaged  for  the  50  clustera.  The  results  are 
shown  in  fig.  9.itiu .  It  will  be  seen  that  on  the  average  almost  half 
(1*5  *)  of  the  pertinent  articles  which  are  included  in  the  final 
cluster  ere  added  during  the  first  quarter  of  the  process. 


21? 


Average  percent 
of  bibliography 
added  per 
quartile 


Process 


r,'ig.  9.Ui.  Graph  shoving  average  percent  of  bibliography 
(or  category)  articles  added  during  epch 
quartile  of  the  clustering  process. 
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CHAPTER  X 
CONCLUSIONS 

In  this  chapter  ve  shall  make  some  initial  comments  concerning  the 
adequacy  of  the  various  components  of  the  experimental  system.  Then 
certain  conclusions  about  the  clustering  procedure  will  be  given.  Next 
the  effectiveness  of  the  overall  model  Bnd  system  in  retrieveing  useful 
sees  of  documents  will  be  evaluated.  In  the  final  section  some  possible 
avenue 8  for  further  research  will  be  suggested. 

10 . 11  MAC  Time -Sharing  Syste.i 

After  five  rears’  experience  with  batch  processing  computers,  the 
author  of  this  thesis  found  the  MAC  time-sharing  system  a  refreshing 
change  with  some  significant  advantages.  Let  us  briefly  comment  on  the 
use  of  the  MAC  system  in  three  areas:  in  debugging  programs,  in  test¬ 
ing  and  evaluating  systems,  and  in  operational  retrieval  functions. 
DEBUGGING 

It  is  estimated  that  the  use  of  the  MAC  system  cut  by  a  factor  of 
somewhere  between  two  and  ten  the  amount  of  time  required  to  debug  the 
experimental  program.  This,  of  course,  is  due  to  the  fact  that  turn¬ 
around  time  for  a  run  with  time -sharing  is  of  the  order  of  a  few 
minutes,  whereas  with  batch  processing  it  is  usually  several  hours  or 
days. 

The  availability  of  more  sophisticated  debugging  routines  would 
have  reduced  debugging  time  even  further,  dome  features  that  would 
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have  been  of  special  help  are  multiple  break  points,  conditional  break 
points,  an  interpretive  mode,  more  convenient  patching,  automatic  up¬ 
dating  of  the  English  text,  etc. 

One  problem  in  using  time-sharing  for  debugging  is  that  it  is 
almost  too  easy  to  make  changes  to  a  program  and  re-run  it.  This 
results  in  one  making  a  change  before  its  consequences  have  been  fully 
considered.  Part  of  the  answer  to  this  problem  lies  in  self  discipline 
on  the  part  of  the  programmer.  It  will  also  help  when  a  computer  be¬ 
comes  available  on  a  2 la -hour  basis  so  one  is  not  tempted  to  try  to  rush 
through  a  change  before  a  maintenance  or  test  session. 

Two  minor  improvements  to  the  consoles  would  help-  A  less  noisy 
console  would  allow  the  user  to  more  effectively  contenplate  a  problem 
at  the  same  time  the  computer  is  printing  out  some  results  on  the  con¬ 
sole.  Also  a  neon  light  showing  when  the  console  is  being  serviced  by 
the  central  processor  would  be  of  considerable  value. 

SYSTEM  TESTING 

After  one  has  obtained  a  program  that  is  debugged  and  performs 
according  to  specification,  it  often  becomes  apparent  that  the  original 
specifications  for  the  program  reed  changing.  This  may  result  in  some 
modifications  to  the  program,  or  if  the  change  ia  extensive,  it  may 
require  rewriting  the  whole  program.  The  same  advantages  and  problems 
that  time-sharing  has  in  debugging  are  al8o  in  evidence  in  this  cycle 
of  program  specification  and  respecification. 

OPERATIONAL  RETRIEVAL 

Let  us  now  consider  what  would  happen  if  one  were  to  decide  to  use 
the  MAC  system  or  one  like  it  aa  an  operational  Information  retrieval 
system  serving  a  community  of  real  users. 
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If  all  of  IBM  1302  di6c  were  used  for  data,  a  file  30  times  the 
size  of  the  current  T.I.P.  file  could  be  stored.  This  would  allow  one 
to  increase  the  time  span  covered  by  the  periodical  literature  from  3 
to  perhaps  10-1$  years  and  also  add  some  non-periodical  literature. 

All  of  the  files  could  also  be  completely  inverted.  There  would 
probably  still  be  room  left  for  coverage  of  another  discipline  about 
the  size  of  physics.  If  magnetic  tapes  were  used,  coverage  could  be 
increased  even  further  by  loading  the  disc  with  different  data  on 
different  days  of  the  week. 

Let  us  assume  that  the  current  limit  of  30  users  on  line  at  once 
is  maintained.  Ihe  response  time  for  simple  requests  for  Information 
would  probably  be  acceptable  to  most  users.  This  would  be  1  second  of 
conputer  time  and  1-30  seconds  of  real  time.  The  response  time  to 
more  complex  requests  would  probably  be  found  objectionable  to  some 
users.  Retrieval  of  a  cluster,  for  example,  might  take  li0-50  seconds 
of  computer  time  and  5-10  minutes  of  real  time. 

The  response  time  to  conplex  requests  could  be  improved  by  a 
factor  of  5-10  if  the  supervisory  system  were  modified  to  allow  some 
type  of  direct  access  to  the  disc.  The  current  supervisory  program  is 
designed  for  the  storage  of  files  that  are  constantly  changing.  This 
places  a  penalty  factor  of  5-10  of  the  accessing  of  files  that  never 
change,  such  as  those  found  in  a  library. 

One  of  the  biggest  difficulties  with  using  the  MAC  system  as  an 
information  retrieval  service  is  that  it  has  no  provision  for  the  trans¬ 
mission,  display  and  reproduction  of  analog  information.  Such  a 
capability  would  probably  be  needed,  for  example,  if  the  system  vere  to 
supply  the  abstracts  or  total  text  of  articles. 
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Thus,  with  the  current  system  a  person  with  a  console  in  his 
office  might  be  able  to  identify  which  articles  are  of  interest,  but 
he  would  still  have  to  go  to  the  library  to  get  them.  (He  could  per¬ 
haps  have  his  own  microfilm  system,  but  this  would  be  very  expensive.) 

10.12  T.I.P.  Document  Collection 

The  first  tests  of  the  clustering  procedure  were  performed  using 
a  single  volume  of  the  Physical  Review.  As  the  data  base  was  increased, 
some  marked  changes  in  the  characteristics  of  the  procedure  were  noted. 
One  of  the  major  causes  of  tnese  changes  was  the  fact  that  the  parti¬ 
tioning  sets  for  the  single  volume  are  all  quite  small,  whereas  the 
partitions  for  the  total  T.I.P.  file  have  a  wide  range  of  sizes. 

The  question  arises  as  to  whether  an  increase  of  perhaps  one  or 
two  orders  of  magnitude  in  the  current  document  file  might  further 
change  the  way  the  procedure  operates.  In  an  attempt  to  answer  this 
question,  let  us  first  note  that  such  an  increase  would  necessarily 
involve  coverage  of  some  additional  branches  of  science  such  as 
chemistry,  mathematics  and/or  electrical  engineering.  This  would  be 
true  since  a  sizeable  fraction  of  the  significant  physics  periodical 
literature  that  is  being  published  is  already  being  added  to  the  T.I.P. 
file.  This  implies  that  the  size  of  the  clusters  generated  by  the 
procedure  would  not  significantly  change  even  if  the  size  of  the 
collection  were  greatly  increased. 

Also  the  use  of  an  inverted  data  storage  system  would  keep  the 
access  time  to  any  one  piece  of  information  relatively  constant  even 
when  the  size  of  the  file  were  measurably  increased.  It  is,  therefore, 
concluded  that  the  system  would  operate  in  essentially  the  same  way  it 


currently  does  even  if  the  document  file  were  scaled  up  in  size  by 
several  orders  of  magnitude. 

10.13  Partitions 

The  experimental  results  as  summarized  in  Fig.  9*U3  are  evidence 
of  the  fact  that  partitions  based  on  citation  information  constitute  a 
useful  data  base  for  the  measure  of  relatedness  and  the  clustering 
procedure.  There  were,  of  course,  a  few  documents  which  were  not  in¬ 
cluded  in  the  cluster  to  which  it  appeared  they  should  belong.  In 
almost  all  of  these  cases  it  was  found  that  the  documents  had  three  or 
fewer  citations  which  was  evidently  an  insufficient  number  to  properly 
place  them  in  their  appropriate  cluster. 

From  this,  one  might  conclude  that  the  clustering  system  as 
presently  programmed  may  not  be  an  effective  retrieval  tool  for  a  file 
in  which  a  large  fraction  of  the  documents  have  three  or  fewer  cita¬ 
tions.  Actually  what  may  be  needed  in  such  a  file  is  a  modification  in 
the  type  or  types  of  partitioning  information  utilized  so  that  parti¬ 
tions  are  also  generated  by  users,  title  words,  authors  or  some  other 
parameter(s).  A  case  where  other  types  of  partitionings  would  nave 
helped  even  in  the  citation-rich  T.I.P.  file  was  described  in  Sec.  9.52. 

10. ll  Storage  Structure 

One  general  conclusion  that  was  reached  in  this  project  is  that  in 
a  dynamic  system  an  attempt  should  he  made  to  give  the  data  a  general 
structure  instead  of  a  structure  tailored  to  one  specific  requirement  . 
Ibis  will  allow  a  flexible  approach  to  new  uses  of  the  data.  An  In¬ 
verted  flic  structure  coupled  with  the  row  data  file  was  suggested  ss  a 
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possible  general  filing  system. 

It  is  argued  in  Sec.  7.22  that  an  Inverted  file  should  occupy 
about  the  same  amount  of  storage  as  is  occupied  by  the  file  which  is 
being  inverted.  This  claim  was  verified  for  the  data  in  the  T.I.P. 
file. 


10.1$  Retrieval  Language 

The  fact  that  both  the  syntax  and  vocabulary  of  the  retrieval 
language  is  table -dr iven(i.e.  they  are  specified  by  tables)  was  con- 
sicdered  to  be  a  significant  advantage.  As  modifications  in  the 
structure  of  the  request  and  in  the  words  used  to  describe  the  request 
suggested  themselves,  they  were  easily  incorporated  into  the  system  by 
a  minor  modification  in  the  appropriate  table. 

Currently  no  one  besides  the  author  of  this  thesis  has  had 
sufficient  experience  with  the  retrieval  language  to  evaluate  it.  bet 
me,  therefore,  make  some  admittedly  biased  observations. 

First,  the  language  was  found  to  be  easy  to  remember  even  after  a 
lapse  of  several  months  in  which  it  wa3  not  used.  The  language  was  also 
found  to  have  considerable  room  for  future  growth.  Indeed  a  large 
number  of  additional  verbs  and  adjectives  that  would  be  useful  in 
retrieval  suggested  themselves.  The  ability  to  make  a  request  for 
information  as  complex  or  as  simple  as  needed  was  also  found  helpful. 
Actually  only  a  maximum  of  about  three  or  four  levels  of  structure  has 


been  utilized  so  far 
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10.2  Evaluation  of  Procedure 

In  this  section  ve  shall  discuss  whether  the  procedure  as  described 
in  Chapter  V  has  the  general  characteristics  which  it  needs  for  opera¬ 
tion  as  a  retrieval  tool.  An  evaluation  of  the  actual  utility  of  the 
current  procedure  and  experimental  system  in  satisfying  user  requests 
will  be  discussed  in  the  next  section. 

CONVERGENCE 

Considerable  difficulty  was  encountered  with  the  earlier  cluster¬ 
ing  procedures  because  they  occasionally  entered  into  a  non -terminating 
cycle.  The  steps  taken  to  prevent  such  cycles  have  been  described  in 
Sec.  5-53.  The  experience  gained  over  the  paet  several  months  supports 
the  contention  that  the  current  procedure  will  always  converge  in  a 
finite  number  of  iterations  to  an  answer  cluster  or  to  a  comment  that 
the  request  is  inconsistent. 

GENERAL-SPECIFIC 

From  Fig.  9-3  one  can  conclude  that  the  use  of  a  bias  in  the 
correlation  network  does,  indeed,  allow  one  to  increase  or  decrease  the 
size  of  the  answer  cluster.  That  the  value  to  be  given  the  bias  can  be 
automatically  determined  by  the  cooqjosition  of  the  request  has  been 
experimentally  verified  by  the  results  of  Sec.' a  9.3-5. 

AMBIGUITY  RESOLUTION 

In  Chapter  IX  examples  are  given  showing  how  some  of  the  possible 
answer  clusters  that  satisfy  a  given  request  can  be  eliminated  by 
specifying  additional  documents  to  be  of  Interest  or  not  of  interest 
(additions  to  the  Y  and  Z  sets).  It  is  clear  that  one  can  arrive  at  a 
point  at  which  only  one  cluster  satisfies  the  request  by  the  appropriate 
additions  to  the  Y  and  Z  sets.  From  Fig.  9.?  one  might  conclude  that 
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on  the  average  at  least  two  members  of  Z  are  required  to  make  a  request 
unambiguous.  Of  course,  even  if  the  request  is  ambiguous,  th»  desired 
answer  cluster  may  still  be  found.  For  example,  in  Sec.  9*31  seven 
out  of  the  ten  requests  with  Y«(b^)  resulted  in  A^  and  yet  all  seven 
are  ambiguous. 

IHCOHSISUMCY  fffiCOOSITIOH 

Prom  the  results  of  Fig.  9*5  we  conclude  that  not  only  does  the 
procedure  mark  as  Inconsistent  those  requests  for  which  there  is  no 
answer  cluster,  but  it  also  decides  that  some  of  the  requests  are 
inconsistent,  for  which  a  valid  answer  cluster  exists.  Ibis  difficulty 
is  not  considered  serious,  however,  since  the  user  can  be  coupled  into 
the  system  and  can  guide  the  procedure  in  the  right  direction  and 
reshape  the  request  if  an  inconsistent  situation  is  reached. 

10.3  Evaluation  of  System 

In  the  last  section  several  conclusions  were  stated  concerning  the 
characteristics  of  the  clustering  procedure.  In  this  section  we  will 
discuss  the  more  general  problem  of  the  effectiveness  of  the  overall 
system  as  a  retrieval  tool. 

Prom  Pig.  9.U3  we  note  that  the  percent  of  pertinent  documents 
retrieved  by  clustering  ranges  from  to  90  4.  This  compares  favor¬ 
ably  with  a  published  retrieval  efficiency  of  about  for  other 
automatic  retrieval  systems. 

Almost  all  of  the  pertinent  documents  which  were  not  retrieved 
were  found  to  have  three  or  fewer  citations.  This  would  give  one  the 
hope  thot  with  an  expanded  data  bat»e  for  the  partitions  the  6h-90  % 
retrlevsl  efficiency  could  be  improved  even  more. 


We  next  note  from  Tig.  9.h)  that  from  I i?  to  86 $  of  the  retrieval 
documents  are  not  part  of  the  set  of  documents  of  known  pertinence. 

Let  us  assume  for  a  moment  that  all  of  these  documents  are  irrelevant. 
Many  users  would  still  find  this  acceptable  since  a  quick  examination 
of  the  titles  could  be  used  to  select  the  articles  of  Interest  from 
the  larger  set. 

How  let  us  consider  whether  or  not  some  of  the  additional  articles 
might  really  be  found  to  be  of  interest  by  a  user  who  has  selected  the 
cluster  in  which  they  are  found. 

First,  we  observe  that  for  the  tests  of  Sec.  9*3  some  of  the 
articles  in  the  clusters  were  published  after  the  October  IEEE  Proceed¬ 
ings  came  out  and  "hua  had  no  chance  of  being  part  of  the  bibliographies 
even  if  they  were  pertinent.  Ibis  is  the  case,  for  example,  with  the 
following  documents  of  Fig.  9.21s  d^,  e^,  k^,  k^,  k^,  k^,  m^, .. ., 

®lgj  b27*  £3*  ^3»  1^*  aD<*  ^5* 

Also  the  authors  of  the  three  bibliographies  used  probably  did  not 
intend  to  exhaustively  cover  the  area.  Ibey  may  have  only  selected 
what  they  considered  to  be  the  best  reference(s)  available  for  each 
specific  concept  or  topic. 

These  arguments  do  not  hold  for  the  articles  added  by  the  cluster¬ 
ing  procedure  to  the  categories  of  Sec.  9>^>  Tbe  categories  are 
supposedly  exhaustive  and  should  include  all  but  the  most  recent 
articles.  In  defense  of  the  additional  articles  In  the  clusters  let 
us  give  two  exaaq>les.  The  first  title  below  is  Included  in  the 
Physical  Review  category  on  "Luminescence"  while  the  second  is  not. 
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1-133-1163 

Optical  properties  of  cubic  SiC,  luminescence  of  nitrogen- 
exciton  complexes,  and  interband  absorption. 

1-133-2023 

Optical  properties  of  15R  SiC,  luminescence  of  nitrogen- 
exciton  complexes,  and  interband  absorption. 

As  a  second  example,  consider  cluster  of  Sec.  9.1*2.  nils 

cluster  contains  three  articles  that  are  classified  in  the  category, 

"Erbium",  in  Physics  Abstracts.  Of  the  31  other  articles  in  the 

cluster  three  contain  the  word,  "erbium",  in  their  title  and  seven 

more  contain  the  word,  "erbium",  in  the  abstract  or  text.  All  of  the 

remaining  articles  have  at  least  one  of  the  other  lit  rare  earth  elements 

mentioned  in  the  title.  The  following  is  an  example  of  an  article 

contained  in  the  cluster  A^  but  not  included  in  the  erbium  category. 

1-126-726  + 

Energy  levels  and  crystal-field  calculations  of  Er,  in 
yttrium  aluminum  garnet. 

Por  the  tests  with  users  described  in  Sec.  9*5  the  percentage  of 
the  cluster  that  is  pertinent  would  be  27/59*1*6  %  for  User  1  and 
?7/i;3»86^  for  User  2  if  all  of  the  articles  of  questionable  (or 
general)  pertinence  vert  counted,  the  user  might  even  find  some  of 
those  articles  Judged  non-pertinent  tc  be  of  interest  If  he  were 
allowed  to  examine  the  actual  article  instead  of  just  the  title. 

The  foregoing  arguments  and  data  suggest  that  a  user  might,  on  the 
average,  find  at  least  half  of  the  documents  in  a  cluster  of  interest. 

It  is  perhaps  significant  that  the  percentage  of  pertinent  docu¬ 
ments  retrieved  is  lower  Jn  the  teats  for  the  two  categories  than  for 
the  other  tests.  The  other  tests  Involved  bibliographies  complied  by 
experts  (authors  and  users)  while  the  categories  were  generated  by 
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indexers. 

One  might  also  note  that  the  testB  of  Sec.  9*3  have  higher  per¬ 
centages  of  pertinent  documents  retrieved  on  the  whole  than  do  the 
tests  of  Sec.  9-5.  This  could  he  explained  by  the  fact  that  the  users 
of  Sec.  9.5  based  their  decisions  on  the  titles,  aur.hors,  and  citations 
of  the  articles,  while  the  authors  of  Sec.  9.3  had  undoubtedly  read  the 
articles  they  cited,  lhe  conclusion  to  be  reached  here  is  that  the 
clustering  procedure  tends  to  do  best  in  those  tests  where  it  was 
compared  to  sets  generated  by  the  careful  consideration  of  experts. 


In  conclusion,  the  experience  of  this  thesis  Indicates  that 
clustering  may  be  a  useful  tool  to  research  workers  who  desire  informa¬ 
tion  covering  either  a  very  specific  or  a  very  broad  irea  of  interest. 
It  is  our  opinion  that  further  development  and  research  is  both 
warranted  and  essential. 

10. h  Suggestions  for  Further  Research 

The  suggestions  to  be  presented  here  have  beer,  divided  into 
three  general  categories: 

(1)  Data  base  and  data  structure 

(2)  Clustering  procedure  and  interaction  language 

(3)  Theoretical  problem 

10. lil  Data  Bale  and  Structure 

OTHER  DATA  BASES 

It  haa  already  been  suggested  (Sec.  10.13)  that  toe  clustering 
ayatea  should  be  rested  on  other  types  of  partition  data.  Some  of  the 
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other  types  of  partitions  that  might  he  tried  are  listed  in  Sec.  6.22. 

It  is  also  suggested  that  tests  he  made  of  the  simultaneous  use  of 
several  types  of  partitioning  data.  In  this  connection  one  might 
consider  the  use  of  a  weighting  factor  for  the  partitions  which  might, 
for  example,  give  a  larger  weight  to  partitions  generated  by  citations 
than  to  those  generated  hy  title  words. 

Of  particular  interest  would  he  a  system  which  utilized  the  type 
of  usage  data  described  in  Chapters  II  and  III. 

CHAlfQIlfO  FII£ 

There  are  a  number  of  questions  relating  the  fact  that  a  document 
collection  is  continually  changing.  What  should  happen  when  documents 
are  added  to  or  deleted  from  the  file?  Can  the  user  be  automatically 
notified  of  new  documents  of  interest?  In  this  connection  one  might 
want  the  user  to  permanently  store  those  clu.  era  found  to  be  of 
interest.  Then  as  nwe  documents  come  into  the  file  they  can  be  com¬ 
pared  against  the  clusters.  The  user  would  then  be  notified  of  tr.ose 
articles  which  were  valid  members  of  his  clusters. 

CODIHG 

There  is  also  nead  for  additional  work  on  the  problem  of  data 
coding  and  compression.  For  example,  one  might  be  able  to  reduce 
storage  requiremrats  considerably  by  storing  cedes  for  sll  (or  certain) 
auth  r?.'  names  in  the  raw  data  file.  This  may  be  true  of  the  other 
types  of  data  also. 

10.12  Procedure  and  language 

There  are  a  number  of  directions  in  uhieb  the  clustering  procedure 
and  interaction  language  might  be  extended.  Che  objective  sight  be  to 
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make  a  wider  class  of  statements  acceptable  and  understandable  to  the 
system.  Ibis  might  involve  increasing  the  vocabulary  and/or  allowing 
other  syntactic  forms. 

PARSING  BY  CONTEXT 

As  a  specific  suggestion  we  note  that  the  current  system  determines 
the  function  of  (parses)  a  word  by  a  simple  table  look-up.  A  word 
cannot  have  a  dual  function  depending  on  its  context.  Thus  if  one  want^ 
to  use  "p"  as  an  abbreviation  for  print  (p.  the  titles  of  set  l),  thin 
would  currently  exclude  its  use  say  es  an  abbreviation  for  paper  or  as 
the  initial  in  an  author's  name  ("get  articles  by  'P.  A.  Jones'"  would 
however  be  acceptable).  It  should  be  possible,  however,  to  distinguish 
between  these  different  uses,  if  one  utilizes  the  context. 

GRAPHIC  DISPLAY 

A  more  radical  extension  of  the  language  would  be  through  the  use 
of  some  type  of  graphical  device.  For  example,  it  might  prove  useful  to 
display  part  of  the  document  network  on  an  oscilloscope  and  to  allow  the 
user  to  specify  t'ne  interesting  and  non-interesting  documents  by  means 
of  a  light  pen. 

In  addition  to  increasing  the  flexibility  of  the  language,  one 
might  also  want  tc  allow  the  specification  of  some  other  functions.  Let 
us  suggest  some  additional  functions  that  the  clustering  procedure 
might  appropriately  perform. 

CLUSTER  SIZE 

A  user  might  want  to  limit  the  size  of  the  answer  cluster  to  some 
specified  range  at  the  outset,  (e.g.  "Get  between  3  and  7  articles 
related  to  Phys,  Rev.  v.  13b  p.  1899*")  This  could  be  accomplished  by 
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increasing  or  decreasing  the  bias  enough  so  that  the  size  of  the  answer 
cluster  fell  within  the  specified  range. 

DATA  BASE 

It  would  also  he  of  value  to  a  user  if  he  could  specify  the  type  of 
partitioning  data  to  he  used  by  the  clustering  procedure.  Thus  the 
connand,  "Get  the  articles  related  by  authors  and  users  to  Phys.  Rev. 
Letters  v.  11  p.  6",  would  use  the  partitions  generated  by  both  authors 
and  usage  data  to  create  the  answer  cluster.  This  control  could  be 
extended  to  select  for  the  data  base  certain  classes  of  partitions 
within  a  broad  type.  For  example,  a  request  of  the  type,  "Get  the 
articles  related  by  M.I.T.  faculty  users  to  Phy.5.  Letters  v.  7  p.  Hi", 
would  allow  the  user  to  single  out  for  use  that  type  of  partitioning 
which  he  thought  would  yield  the  best  results. 

CLUSTERS  GF  AUTHORS, ETC. 

There  is  no  real  reason  why  clusters  must  be  limited  to  sets  of 
documents.  It  may  be  useful  to  generalize  the  system  to  allow  clusters 
to  be  formed  of  other  types  of  entities  such  as  authors,  locations, 
words,  etc.  It  might  be  very  helpful,  for  example,  to  be  able  to  deter¬ 
mine  the  cluster  of  scientists  that  are  working  in  a  given  field  or  area. 

10.1i3  Theoretical  Problems 
ANSWER  CLUSTER  DEFINITION 

Some  modification  to  the  definition  of  an  answer  cluster  may  be  of 
value.  For  example,  should  a  change  be  made  to  the  requirement  that  all 
fhe  documents  specified  as  interesting  be  in  the  cluster? 

NOISE 

There  will,  of  course,  be  cases  where  certain  documents  ore 
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mistakenly  included  together  in  a  set  of  interest.  This  may  ari3e,  for 
example,  from  an  incorrect  judgement  on  the  part  of  a  user  or  perhaps 
by  a  clerical  slip.  The  effect  of  this  type  of  noise  on  the  system 
should  be  investigated.  Also  suitable  steps  should  be  taken  to  maintain 
the  integrity  of  the  data  base  through  editing  processes. 

SELF-SUSTAINING  RUTS 

Consider  an  information  retrieval  system  which  is  based  on  the 
data  generated  by  its  users.  This  might  be  one  based  on  usage  data  or 
on  citations.  Is  it  possible  in  such  a  system  for  a  self -reinforcing 
feedback  loop  to  be  created  which  cannot  be  altered?  For  example,  if 
users  are  supplied  documents  on  the  basis  of  past  use,  this  may  create 
new  partitions  which  only  serve  to  reinforce  the  results  of  the  old 
partitions. 

EVALUATION  HEASURE 

The  measure  described  In  Chapter  III  was  not  suggested  for  use  in 
rating  the  merit  or  value  of  documents.  Its  function  was  to  group 
together  documents  that  were  mutually  pertinent.  If  a  suitable  way 
could  be  devised  for  measuring  the  worth  of  documents,  this  would  be  of 
considerable  aid  to  users.  Perhaps  this  would  take  the  form  of  some 
type  of  concensus  of  opinion  of  the  previous  users  of  the  documents. 
TRAILS  VS.  SETS 

In  the  article  already  cited  by  V.  Bush  the  model  suggested  for 
information  retrieval  was  a  trail  leading  from  one  pertinent  document 
to  the  next.  The  model  used  in  this  research  endeavor  is  the  partition¬ 
ing  of  the  file  into  two  subsets.  Actually  both  models  have  useful 
feature*.  In  some  case*  there  is  a  definite  pattern  or  trail  which 
should  be  followed  in  consulting  the  documents  related  to  a  given 


subject.  In  other  cases  the  order  In  which  the  documents  should  be 
examined  is  apparent  from  their  publication  data.  In  still  other  cases 
there  is  no  particular  order  in  which  the  documents  need  be  consulted. 
Tlius  it  would  seem  that  one  might  want  to  Include  both  the  ideas  of 
sets  of  documents  and  trails  of  documents  in  a  more  general  information 
retrieval  model. 

PREDICTIVE  USAGE 

As  additional  information  becomes  available  on  the  types  of 
questions  that  are  asked  by  users  and  the  seta  of  documents  that  seem 
to  satisfy  them,  it  may  be  possible  to  design  a  system  involving  some 
form  of  prediction  of  what  a  user  really  wants  when  he  asks  a  given 
question.  This  might  even  be  extended  to  involve  trends  in  document 
usage,  so  that  future  document  use  is  extrapolated  on  the  basis  of 
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MEASURES  OF  RELATEDHESS 

Some  of  the  measures  which  have  been  proposed  for  use  in  informa¬ 
tion  retrieval  are  tabulated  below.  Measures  (l)  to  (6)  were  originally 
suggested  in  terms  of  frequency  counts.  Measures  (7)  and  (8)  were  first 
proposed  in  terms  of  probabilities.  For  purposes  of  comparison  we  have 
attempted  to  express  each  measure  in  the  table  both  in  ?rms  of 
probabilities  and  frequency  counts.  In  the  case  of  measure  (5)  this 
was  not  possible. 

The  definitions  for  the  symbols  used  in  the  table  and  the  con¬ 
version  formulae  for  going  from  probabilities  to  frequency  counts  and 
back  again  are  found  in  Sec.  3.1.  It  wbs  necessary  to  add  superscripts 
to  the  frequency  counts  in  the  table  to  distinguish  between  some 
additional  counts  whiu..  appear  in  these  measures.  Thus  is  the 

number  of  partitions  in  which  the  subset  of  interest  contains  document 
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Name 
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A  new  tyj«i  of  Information  retrieval  ays tea  Is  suggested  which  utilise*  data 
of  the  type  generated  by  users  of  the  systea  Instead  of  data  generated  by  Indexers. 
The  theoretical  sextel  on  idtich  the  systasi  Is  based  consists  of  three  basic  elements. 
The  first  element  Is  a  aasture  of  the  relstsdnsss  between  document-pairs.  It  Is 
derived  fraw  Information  theory.  The  second  element  Is  s  definition  of  whet  cow 
atltuts*  *  set  (cluster)  of  Inter-related  dcctaaauca.  This  definition  la  based  on 
the  measure  of  reUtcdn***.  The  lest  almaer.;  is  •  procedui.  which  transform*  * 
reguest  far  Inforaatlon  Into  a  cluster  of  answer  documents.  An  experimental  system 
was  developed  to  tart  the  model  lu  a  realistic  environment .  It  wee  programmed  for 
the  Project  MAC  tlsw-therlng  system  end  utilised  the  physics  data  fils  of  th* 
Technical  Information  Project.  Citations  wars  used  se  the  data  bees  fot  the  sweeuro 
of  reletedneee.  A  file  structure  and  retrieval  language  were  designed  which  allowed 
does  mewaaehine  coupling,  let r levs l  efficiency  caaparsd  to  known  sets  wee  <0  * 

TO  percent,  sad  ways  of  Improving  this  further  ere  luggeeted. 


i»  •«»  ««lo< 

Cenputste  Hschlne- aided  cognition  Beel-tlma  computer  systems 

Document  starching  Kultlple-ecctts  conpnters  Time-sharing 

Information  ntrlrv  il  Om-ltne  rsepmer  ays  tame  Tlmo-sharsd  computer  systems 
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